Speaker-Machine Interaction in Automatic Speech Recognition. Technical Report.
ERIC Educational Resources Information Center
Makhoul, John I.
The feasibility and limitations of speaker adaptation in improving the performance of a "fixed" (speaker-independent) automatic speech recognition system were examined. A fixed vocabulary of 55 syllables is used in the recognition system which contains 11 stops and fricatives and five tense vowels. The results of an experiment on speaker…
Statistical Evaluation of Biometric Evidence in Forensic Automatic Speaker Recognition
NASA Astrophysics Data System (ADS)
Drygajlo, Andrzej
Forensic speaker recognition is the process of determining if a specific individual (suspected speaker) is the source of a questioned voice recording (trace). This paper aims at presenting forensic automatic speaker recognition (FASR) methods that provide a coherent way of quantifying and presenting recorded voice as biometric evidence. In such methods, the biometric evidence consists of the quantified degree of similarity between speaker-dependent features extracted from the trace and speaker-dependent features extracted from recorded speech of a suspect. The interpretation of recorded voice as evidence in the forensic context presents particular challenges, including within-speaker (within-source) variability and between-speakers (between-sources) variability. Consequently, FASR methods must provide a statistical evaluation which gives the court an indication of the strength of the evidence given the estimated within-source and between-sources variabilities. This paper reports on the first ENFSI evaluation campaign through a fake case, organized by the Netherlands Forensic Institute (NFI), as an example, where an automatic method using the Gaussian mixture models (GMMs) and the Bayesian interpretation (BI) framework were implemented for the forensic speaker recognition task.
Automatic Intention Recognition in Conversation Processing
ERIC Educational Resources Information Center
Holtgraves, Thomas
2008-01-01
A fundamental assumption of many theories of conversation is that comprehension of a speaker's utterance involves recognition of the speaker's intention in producing that remark. However, the nature of intention recognition is not clear. One approach is to conceptualize a speaker's intention in terms of speech acts [Searle, J. (1969). "Speech…
Botti, F; Alexander, A; Drygajlo, A
2004-12-02
This paper deals with a procedure to compensate for mismatched recording conditions in forensic speaker recognition, using a statistical score normalization. Bayesian interpretation of the evidence in forensic automatic speaker recognition depends on three sets of recordings in order to perform forensic casework: reference (R) and control (C) recordings of the suspect, and a potential population database (P), as well as a questioned recording (QR) . The requirement of similar recording conditions between suspect control database (C) and the questioned recording (QR) is often not satisfied in real forensic cases. The aim of this paper is to investigate a procedure of normalization of scores, which is based on an adaptation of the Test-normalization (T-norm) [2] technique used in the speaker verification domain, to compensate for the mismatch. Polyphone IPSC-02 database and ASPIC (an automatic speaker recognition system developed by EPFL and IPS-UNIL in Lausanne, Switzerland) were used in order to test the normalization procedure. Experimental results for three different recording condition scenarios are presented using Tippett plots and the effect of the compensation on the evaluation of the strength of the evidence is discussed.
Modelling Errors in Automatic Speech Recognition for Dysarthric Speakers
NASA Astrophysics Data System (ADS)
Caballero Morales, Santiago Omar; Cox, Stephen J.
2009-12-01
Dysarthria is a motor speech disorder characterized by weakness, paralysis, or poor coordination of the muscles responsible for speech. Although automatic speech recognition (ASR) systems have been developed for disordered speech, factors such as low intelligibility and limited phonemic repertoire decrease speech recognition accuracy, making conventional speaker adaptation algorithms perform poorly on dysarthric speakers. In this work, rather than adapting the acoustic models, we model the errors made by the speaker and attempt to correct them. For this task, two techniques have been developed: (1) a set of "metamodels" that incorporate a model of the speaker's phonetic confusion matrix into the ASR process; (2) a cascade of weighted finite-state transducers at the confusion matrix, word, and language levels. Both techniques attempt to correct the errors made at the phonetic level and make use of a language model to find the best estimate of the correct word sequence. Our experiments show that both techniques outperform standard adaptation techniques.
Automatic speech recognition technology development at ITT Defense Communications Division
NASA Technical Reports Server (NTRS)
White, George M.
1977-01-01
An assessment of the applications of automatic speech recognition to defense communication systems is presented. Future research efforts include investigations into the following areas: (1) dynamic programming; (2) recognition of speech degraded by noise; (3) speaker independent recognition; (4) large vocabulary recognition; (5) word spotting and continuous speech recognition; and (6) isolated word recognition.
ERIC Educational Resources Information Center
Young, Victoria; Mihailidis, Alex
2010-01-01
Despite their growing presence in home computer applications and various telephony services, commercial automatic speech recognition technologies are still not easily employed by everyone; especially individuals with speech disorders. In addition, relatively little research has been conducted on automatic speech recognition performance with older…
Early Detection of Severe Apnoea through Voice Analysis and Automatic Speaker Recognition Techniques
NASA Astrophysics Data System (ADS)
Fernández, Ruben; Blanco, Jose Luis; Díaz, David; Hernández, Luis A.; López, Eduardo; Alcázar, José
This study is part of an on-going collaborative effort between the medical and the signal processing communities to promote research on applying voice analysis and Automatic Speaker Recognition techniques (ASR) for the automatic diagnosis of patients with severe obstructive sleep apnoea (OSA). Early detection of severe apnoea cases is important so that patients can receive early treatment. Effective ASR-based diagnosis could dramatically cut medical testing time. Working with a carefully designed speech database of healthy and apnoea subjects, we present and discuss the possibilities of using generative Gaussian Mixture Models (GMMs), generally used in ASR systems, to model distinctive apnoea voice characteristics (i.e. abnormal nasalization). Finally, we present experimental findings regarding the discriminative power of speaker recognition techniques applied to severe apnoea detection. We have achieved an 81.25 % correct classification rate, which is very promising and underpins the interest in this line of inquiry.
A Limited-Vocabulary, Multi-Speaker Automatic Isolated Word Recognition System.
ERIC Educational Resources Information Center
Paul, James E., Jr.
Techniques for automatic recognition of isolated words are investigated, and a computer simulation of a word recognition system is effected. Considered in detail are data acquisition and digitizing, word detection, amplitude and time normalization, short-time spectral estimation including spectral windowing, spectral envelope approximation,…
Robust Recognition of Loud and Lombard speech in the Fighter Cockpit Environment
1988-08-01
the latter as inter-speaker variability. According to Zue [Z85j, inter-speaker variabilities can be attributed to sociolinguistic background, dialect...34 Journal of the Acoustical Society of America , Vol 50, 1971. [At74I B. S. Atal, "Linear prediction for speaker identification," Journal of the Acoustical...Society of America , Vol 55, 1974. [B771 B. Beek, E. P. Neuberg, and D. C. Hodge, "An Assessment of the Technology of Automatic Speech Recognition for
Parametric Representation of the Speaker's Lips for Multimodal Sign Language and Speech Recognition
NASA Astrophysics Data System (ADS)
Ryumin, D.; Karpov, A. A.
2017-05-01
In this article, we propose a new method for parametric representation of human's lips region. The functional diagram of the method is described and implementation details with the explanation of its key stages and features are given. The results of automatic detection of the regions of interest are illustrated. A speed of the method work using several computers with different performances is reported. This universal method allows applying parametrical representation of the speaker's lipsfor the tasks of biometrics, computer vision, machine learning, and automatic recognition of face, elements of sign languages, and audio-visual speech, including lip-reading.
Using Automatic Speech Recognition Technology with Elicited Oral Response Testing
ERIC Educational Resources Information Center
Cox, Troy L.; Davies, Randall S.
2012-01-01
This study examined the use of automatic speech recognition (ASR) scored elicited oral response (EOR) tests to assess the speaking ability of English language learners. It also examined the relationship between ASR-scored EOR and other language proficiency measures and the ability of the ASR to rate speakers without bias to gender or native…
NASA Astrophysics Data System (ADS)
Kayasith, Prakasith; Theeramunkong, Thanaruk
It is a tedious and subjective task to measure severity of a dysarthria by manually evaluating his/her speech using available standard assessment methods based on human perception. This paper presents an automated approach to assess speech quality of a dysarthric speaker with cerebral palsy. With the consideration of two complementary factors, speech consistency and speech distinction, a speech quality indicator called speech clarity index (Ψ) is proposed as a measure of the speaker's ability to produce consistent speech signal for a certain word and distinguished speech signal for different words. As an application, it can be used to assess speech quality and forecast speech recognition rate of speech made by an individual dysarthric speaker before actual exhaustive implementation of an automatic speech recognition system for the speaker. The effectiveness of Ψ as a speech recognition rate predictor is evaluated by rank-order inconsistency, correlation coefficient, and root-mean-square of difference. The evaluations had been done by comparing its predicted recognition rates with ones predicted by the standard methods called the articulatory and intelligibility tests based on the two recognition systems (HMM and ANN). The results show that Ψ is a promising indicator for predicting recognition rate of dysarthric speech. All experiments had been done on speech corpus composed of speech data from eight normal speakers and eight dysarthric speakers.
The Development of the Speaker Independent ARM Continuous Speech Recognition System
1992-01-01
spokeTi airborne reconnaissance reports u-ing a speech recognition system based on phoneme-level hidden Markov models (HMMs). Previous versions of the ARM...will involve automatic selection from multiple model sets, corresponding to different speaker types, and that the most rudimen- tary partition of a...The vocabulary size for the ARM task is 497 words. These words are related to the phoneme-level symbols corresponding to the models in the model set
Hantke, Simone; Weninger, Felix; Kurle, Richard; Ringeval, Fabien; Batliner, Anton; Mousa, Amr El-Desoky; Schuller, Björn
2016-01-01
We propose a new recognition task in the area of computational paralinguistics: automatic recognition of eating conditions in speech, i. e., whether people are eating while speaking, and what they are eating. To this end, we introduce the audio-visual iHEARu-EAT database featuring 1.6 k utterances of 30 subjects (mean age: 26.1 years, standard deviation: 2.66 years, gender balanced, German speakers), six types of food (Apple, Nectarine, Banana, Haribo Smurfs, Biscuit, and Crisps), and read as well as spontaneous speech, which is made publicly available for research purposes. We start with demonstrating that for automatic speech recognition (ASR), it pays off to know whether speakers are eating or not. We also propose automatic classification both by brute-forcing of low-level acoustic features as well as higher-level features related to intelligibility, obtained from an Automatic Speech Recogniser. Prediction of the eating condition was performed with a Support Vector Machine (SVM) classifier employed in a leave-one-speaker-out evaluation framework. Results show that the binary prediction of eating condition (i. e., eating or not eating) can be easily solved independently of the speaking condition; the obtained average recalls are all above 90%. Low-level acoustic features provide the best performance on spontaneous speech, which reaches up to 62.3% average recall for multi-way classification of the eating condition, i. e., discriminating the six types of food, as well as not eating. The early fusion of features related to intelligibility with the brute-forced acoustic feature set improves the performance on read speech, reaching a 66.4% average recall for the multi-way classification task. Analysing features and classifier errors leads to a suitable ordinal scale for eating conditions, on which automatic regression can be performed with up to 56.2% determination coefficient. PMID:27176486
Analysis of human scream and its impact on text-independent speaker verification.
Hansen, John H L; Nandwana, Mahesh Kumar; Shokouhi, Navid
2017-04-01
Scream is defined as sustained, high-energy vocalizations that lack phonological structure. Lack of phonological structure is how scream is identified from other forms of loud vocalization, such as "yell." This study investigates the acoustic aspects of screams and addresses those that are known to prevent standard speaker identification systems from recognizing the identity of screaming speakers. It is well established that speaker variability due to changes in vocal effort and Lombard effect contribute to degraded performance in automatic speech systems (i.e., speech recognition, speaker identification, diarization, etc.). However, previous research in the general area of speaker variability has concentrated on human speech production, whereas less is known about non-speech vocalizations. The UT-NonSpeech corpus is developed here to investigate speaker verification from scream samples. This study considers a detailed analysis in terms of fundamental frequency, spectral peak shift, frame energy distribution, and spectral tilt. It is shown that traditional speaker recognition based on the Gaussian mixture models-universal background model framework is unreliable when evaluated with screams.
NASA Astrophysics Data System (ADS)
Wang, Hongcui; Kawahara, Tatsuya
CALL (Computer Assisted Language Learning) systems using ASR (Automatic Speech Recognition) for second language learning have received increasing interest recently. However, it still remains a challenge to achieve high speech recognition performance, including accurate detection of erroneous utterances by non-native speakers. Conventionally, possible error patterns, based on linguistic knowledge, are added to the lexicon and language model, or the ASR grammar network. However, this approach easily falls in the trade-off of coverage of errors and the increase of perplexity. To solve the problem, we propose a method based on a decision tree to learn effective prediction of errors made by non-native speakers. An experimental evaluation with a number of foreign students learning Japanese shows that the proposed method can effectively generate an ASR grammar network, given a target sentence, to achieve both better coverage of errors and smaller perplexity, resulting in significant improvement in ASR accuracy.
An automatic speech recognition system with speaker-independent identification support
NASA Astrophysics Data System (ADS)
Caranica, Alexandru; Burileanu, Corneliu
2015-02-01
The novelty of this work relies on the application of an open source research software toolkit (CMU Sphinx) to train, build and evaluate a speech recognition system, with speaker-independent support, for voice-controlled hardware applications. Moreover, we propose to use the trained acoustic model to successfully decode offline voice commands on embedded hardware, such as an ARMv6 low-cost SoC, Raspberry PI. This type of single-board computer, mainly used for educational and research activities, can serve as a proof-of-concept software and hardware stack for low cost voice automation systems.
Speaker emotion recognition: from classical classifiers to deep neural networks
NASA Astrophysics Data System (ADS)
Mezghani, Eya; Charfeddine, Maha; Nicolas, Henri; Ben Amar, Chokri
2018-04-01
Speaker emotion recognition is considered among the most challenging tasks in recent years. In fact, automatic systems for security, medicine or education can be improved when considering the speech affective state. In this paper, a twofold approach for speech emotion classification is proposed. At the first side, a relevant set of features is adopted, and then at the second one, numerous supervised training techniques, involving classic methods as well as deep learning, are experimented. Experimental results indicate that deep architecture can improve classification performance on two affective databases, the Berlin Dataset of Emotional Speech and the SAVEE Dataset Surrey Audio-Visual Expressed Emotion.
Multilevel Analysis in Analyzing Speech Data
ERIC Educational Resources Information Center
Guddattu, Vasudeva; Krishna, Y.
2011-01-01
The speech produced by human vocal tract is a complex acoustic signal, with diverse applications in phonetics, speech synthesis, automatic speech recognition, speaker identification, communication aids, speech pathology, speech perception, machine translation, hearing research, rehabilitation and assessment of communication disorders and many…
Automatic speech recognition research at NASA-Ames Research Center
NASA Technical Reports Server (NTRS)
Coler, Clayton R.; Plummer, Robert P.; Huff, Edward M.; Hitchcock, Myron H.
1977-01-01
A trainable acoustic pattern recognizer manufactured by Scope Electronics is presented. The voice command system VCS encodes speech by sampling 16 bandpass filters with center frequencies in the range from 200 to 5000 Hz. Variations in speaking rate are compensated for by a compression algorithm that subdivides each utterance into eight subintervals in such a way that the amount of spectral change within each subinterval is the same. The recorded filter values within each subinterval are then reduced to a 15-bit representation, giving a 120-bit encoding for each utterance. The VCS incorporates a simple recognition algorithm that utilizes five training samples of each word in a vocabulary of up to 24 words. The recognition rate of approximately 85 percent correct for untrained speakers and 94 percent correct for trained speakers was not considered adequate for flight systems use. Therefore, the built-in recognition algorithm was disabled, and the VCS was modified to transmit 120-bit encodings to an external computer for recognition.
NASA Astrophysics Data System (ADS)
Fernández Pozo, Rubén; Blanco Murillo, Jose Luis; Hernández Gómez, Luis; López Gonzalo, Eduardo; Alcázar Ramírez, José; Toledano, Doroteo T.
2009-12-01
This study is part of an ongoing collaborative effort between the medical and the signal processing communities to promote research on applying standard Automatic Speech Recognition (ASR) techniques for the automatic diagnosis of patients with severe obstructive sleep apnoea (OSA). Early detection of severe apnoea cases is important so that patients can receive early treatment. Effective ASR-based detection could dramatically cut medical testing time. Working with a carefully designed speech database of healthy and apnoea subjects, we describe an acoustic search for distinctive apnoea voice characteristics. We also study abnormal nasalization in OSA patients by modelling vowels in nasal and nonnasal phonetic contexts using Gaussian Mixture Model (GMM) pattern recognition on speech spectra. Finally, we present experimental findings regarding the discriminative power of GMMs applied to severe apnoea detection. We have achieved an 81% correct classification rate, which is very promising and underpins the interest in this line of inquiry.
The Influence of Anticipation of Word Misrecognition on the Likelihood of Stuttering
ERIC Educational Resources Information Center
Brocklehurst, Paul H.; Lickley, Robin J.; Corley, Martin
2012-01-01
This study investigates whether the experience of stuttering can result from the speaker's anticipation of his words being misrecognized. Twelve adults who stutter (AWS) repeated single words into what appeared to be an automatic speech-recognition system. Following each iteration of each word, participants provided a self-rating of whether they…
Parker, Mark; Cunningham, Stuart; Enderby, Pam; Hawley, Mark; Green, Phil
2006-01-01
The STARDUST project developed robust computer speech recognizers for use by eight people with severe dysarthria and concomitant physical disability to access assistive technologies. Independent computer speech recognizers trained with normal speech are of limited functional use by those with severe dysarthria due to limited and inconsistent proximity to "normal" articulatory patterns. Severe dysarthric output may also be characterized by a small mass of distinguishable phonetic tokens making the acoustic differentiation of target words difficult. Speaker dependent computer speech recognition using Hidden Markov Models was achieved by the identification of robust phonetic elements within the individual speaker output patterns. A new system of speech training using computer generated visual and auditory feedback reduced the inconsistent production of key phonetic tokens over time.
Speaker normalization for chinese vowel recognition in cochlear implants.
Luo, Xin; Fu, Qian-Jie
2005-07-01
Because of the limited spectra-temporal resolution associated with cochlear implants, implant patients often have greater difficulty with multitalker speech recognition. The present study investigated whether multitalker speech recognition can be improved by applying speaker normalization techniques to cochlear implant speech processing. Multitalker Chinese vowel recognition was tested with normal-hearing Chinese-speaking subjects listening to a 4-channel cochlear implant simulation, with and without speaker normalization. For each subject, speaker normalization was referenced to the speaker that produced the best recognition performance under conditions without speaker normalization. To match the remaining speakers to this "optimal" output pattern, the overall frequency range of the analysis filter bank was adjusted for each speaker according to the ratio of the mean third formant frequency values between the specific speaker and the reference speaker. Results showed that speaker normalization provided a small but significant improvement in subjects' overall recognition performance. After speaker normalization, subjects' patterns of recognition performance across speakers changed, demonstrating the potential for speaker-dependent effects with the proposed normalization technique.
Experimental study on GMM-based speaker recognition
NASA Astrophysics Data System (ADS)
Ye, Wenxing; Wu, Dapeng; Nucci, Antonio
2010-04-01
Speaker recognition plays a very important role in the field of biometric security. In order to improve the recognition performance, many pattern recognition techniques have be explored in the literature. Among these techniques, the Gaussian Mixture Model (GMM) is proved to be an effective statistic model for speaker recognition and is used in most state-of-the-art speaker recognition systems. The GMM is used to represent the 'voice print' of a speaker through modeling the spectral characteristic of speech signals of the speaker. In this paper, we implement a speaker recognition system, which consists of preprocessing, Mel-Frequency Cepstrum Coefficients (MFCCs) based feature extraction, and GMM based classification. We test our system with TIDIGITS data set (325 speakers) and our own recordings of more than 200 speakers; our system achieves 100% correct recognition rate. Moreover, we also test our system under the scenario that training samples are from one language but test samples are from a different language; our system also achieves 100% correct recognition rate, which indicates that our system is language independent.
NASA Astrophysics Data System (ADS)
Kamiński, K.; Dobrowolski, A. P.
2017-04-01
The paper presents the architecture and the results of optimization of selected elements of the Automatic Speaker Recognition (ASR) system that uses Gaussian Mixture Models (GMM) in the classification process. Optimization was performed on the process of selection of individual characteristics using the genetic algorithm and the parameters of Gaussian distributions used to describe individual voices. The system that was developed was tested in order to evaluate the impact of different compression methods used, among others, in landline, mobile, and VoIP telephony systems, on effectiveness of the speaker identification. Also, the results were presented of effectiveness of speaker identification at specific levels of noise with the speech signal and occurrence of other disturbances that could appear during phone calls, which made it possible to specify the spectrum of applications of the presented ASR system.
NASA Technical Reports Server (NTRS)
Wolf, Jared J.
1977-01-01
The following research was discussed: (1) speech signal processing; (2) automatic speech recognition; (3) continuous speech understanding; (4) speaker recognition; (5) speech compression; (6) subjective and objective evaluation of speech communication system; (7) measurement of the intelligibility and quality of speech when degraded by noise or other masking stimuli; (8) speech synthesis; (9) instructional aids for second-language learning and for training of the deaf; and (10) investigation of speech correlates of psychological stress. Experimental psychology, control systems, and human factors engineering, which are often relevant to the proper design and operation of speech systems are described.
Video indexing based on image and sound
NASA Astrophysics Data System (ADS)
Faudemay, Pascal; Montacie, Claude; Caraty, Marie-Jose
1997-10-01
Video indexing is a major challenge for both scientific and economic reasons. Information extraction can sometimes be easier from sound channel than from image channel. We first present a multi-channel and multi-modal query interface, to query sound, image and script through 'pull' and 'push' queries. We then summarize the segmentation phase, which needs information from the image channel. Detection of critical segments is proposed. It should speed-up both automatic and manual indexing. We then present an overview of the information extraction phase. Information can be extracted from the sound channel, through speaker recognition, vocal dictation with unconstrained vocabularies, and script alignment with speech. We present experiment results for these various techniques. Speaker recognition methods were tested on the TIMIT and NTIMIT database. Vocal dictation as experimented on newspaper sentences spoken by several speakers. Script alignment was tested on part of a carton movie, 'Ivanhoe'. For good quality sound segments, error rates are low enough for use in indexing applications. Major issues are the processing of sound segments with noise or music, and performance improvement through the use of appropriate, low-cost architectures or networks of workstations.
When speaker identity is unavoidable: Neural processing of speaker identity cues in natural speech.
Tuninetti, Alba; Chládková, Kateřina; Peter, Varghese; Schiller, Niels O; Escudero, Paola
2017-11-01
Speech sound acoustic properties vary largely across speakers and accents. When perceiving speech, adult listeners normally disregard non-linguistic variation caused by speaker or accent differences, in order to comprehend the linguistic message, e.g. to correctly identify a speech sound or a word. Here we tested whether the process of normalizing speaker and accent differences, facilitating the recognition of linguistic information, is found at the level of neural processing, and whether it is modulated by the listeners' native language. In a multi-deviant oddball paradigm, native and nonnative speakers of Dutch were exposed to naturally-produced Dutch vowels varying in speaker, sex, accent, and phoneme identity. Unexpectedly, the analysis of mismatch negativity (MMN) amplitudes elicited by each type of change shows a large degree of early perceptual sensitivity to non-linguistic cues. This finding on perception of naturally-produced stimuli contrasts with previous studies examining the perception of synthetic stimuli wherein adult listeners automatically disregard acoustic cues to speaker identity. The present finding bears relevance to speech normalization theories, suggesting that at an unattended level of processing, listeners are indeed sensitive to changes in fundamental frequency in natural speech tokens. Copyright © 2017 Elsevier Inc. All rights reserved.
Visual speech influences speech perception immediately but not automatically.
Mitterer, Holger; Reinisch, Eva
2017-02-01
Two experiments examined the time course of the use of auditory and visual speech cues to spoken word recognition using an eye-tracking paradigm. Results of the first experiment showed that the use of visual speech cues from lipreading is reduced if concurrently presented pictures require a division of attentional resources. This reduction was evident even when listeners' eye gaze was on the speaker rather than the (static) pictures. Experiment 2 used a deictic hand gesture to foster attention to the speaker. At the same time, the visual processing load was reduced by keeping the visual display constant over a fixed number of successive trials. Under these conditions, the visual speech cues from lipreading were used. Moreover, the eye-tracking data indicated that visual information was used immediately and even earlier than auditory information. In combination, these data indicate that visual speech cues are not used automatically, but if they are used, they are used immediately.
Shahamiri, Seyed Reza; Salim, Siti Salwah Binti
2014-09-01
Automatic speech recognition (ASR) can be very helpful for speakers who suffer from dysarthria, a neurological disability that damages the control of motor speech articulators. Although a few attempts have been made to apply ASR technologies to sufferers of dysarthria, previous studies show that such ASR systems have not attained an adequate level of performance. In this study, a dysarthric multi-networks speech recognizer (DM-NSR) model is provided using a realization of multi-views multi-learners approach called multi-nets artificial neural networks, which tolerates variability of dysarthric speech. In particular, the DM-NSR model employs several ANNs (as learners) to approximate the likelihood of ASR vocabulary words and to deal with the complexity of dysarthric speech. The proposed DM-NSR approach was presented as both speaker-dependent and speaker-independent paradigms. In order to highlight the performance of the proposed model over legacy models, multi-views single-learner models of the DM-NSRs were also provided and their efficiencies were compared in detail. Moreover, a comparison among the prominent dysarthric ASR methods and the proposed one is provided. The results show that the DM-NSR recorded improved recognition rate by up to 24.67% and the error rate was reduced by up to 8.63% over the reference model.
Shin, Young Hoon; Seo, Jiwon
2016-10-29
People with hearing or speaking disabilities are deprived of the benefits of conventional speech recognition technology because it is based on acoustic signals. Recent research has focused on silent speech recognition systems that are based on the motions of a speaker's vocal tract and articulators. Because most silent speech recognition systems use contact sensors that are very inconvenient to users or optical systems that are susceptible to environmental interference, a contactless and robust solution is hence required. Toward this objective, this paper presents a series of signal processing algorithms for a contactless silent speech recognition system using an impulse radio ultra-wide band (IR-UWB) radar. The IR-UWB radar is used to remotely and wirelessly detect motions of the lips and jaw. In order to extract the necessary features of lip and jaw motions from the received radar signals, we propose a feature extraction algorithm. The proposed algorithm noticeably improved speech recognition performance compared to the existing algorithm during our word recognition test with five speakers. We also propose a speech activity detection algorithm to automatically select speech segments from continuous input signals. Thus, speech recognition processing is performed only when speech segments are detected. Our testbed consists of commercial off-the-shelf radar products, and the proposed algorithms are readily applicable without designing specialized radar hardware for silent speech processing.
Speaker recognition with temporal cues in acoustic and electric hearing
NASA Astrophysics Data System (ADS)
Vongphoe, Michael; Zeng, Fan-Gang
2005-08-01
Natural spoken language processing includes not only speech recognition but also identification of the speaker's gender, age, emotional, and social status. Our purpose in this study is to evaluate whether temporal cues are sufficient to support both speech and speaker recognition. Ten cochlear-implant and six normal-hearing subjects were presented with vowel tokens spoken by three men, three women, two boys, and two girls. In one condition, the subject was asked to recognize the vowel. In the other condition, the subject was asked to identify the speaker. Extensive training was provided for the speaker recognition task. Normal-hearing subjects achieved nearly perfect performance in both tasks. Cochlear-implant subjects achieved good performance in vowel recognition but poor performance in speaker recognition. The level of the cochlear implant performance was functionally equivalent to normal performance with eight spectral bands for vowel recognition but only to one band for speaker recognition. These results show a disassociation between speech and speaker recognition with primarily temporal cues, highlighting the limitation of current speech processing strategies in cochlear implants. Several methods, including explicit encoding of fundamental frequency and frequency modulation, are proposed to improve speaker recognition for current cochlear implant users.
On the Development of Speech Resources for the Mixtec Language
2013-01-01
The Mixtec language is one of the main native languages in Mexico. In general, due to urbanization, discrimination, and limited attempts to promote the culture, the native languages are disappearing. Most of the information available about the Mixtec language is in written form as in dictionaries which, although including examples about how to pronounce the Mixtec words, are not as reliable as listening to the correct pronunciation from a native speaker. Formal acoustic resources, as speech corpora, are almost non-existent for the Mixtec, and no speech technologies are known to have been developed for it. This paper presents the development of the following resources for the Mixtec language: (1) a speech database of traditional narratives of the Mixtec culture spoken by a native speaker (labelled at the phonetic and orthographic levels by means of spectral analysis) and (2) a native speaker-adaptive automatic speech recognition (ASR) system (trained with the speech database) integrated with a Mixtec-to-Spanish/Spanish-to-Mixtec text translator. The speech database, although small and limited to a single variant, was reliable enough to build the multiuser speech application which presented a mean recognition/translation performance up to 94.36% in experiments with non-native speakers (the target users). PMID:23710134
Hybrid Speaker Recognition Using Universal Acoustic Model
NASA Astrophysics Data System (ADS)
Nishimura, Jun; Kuroda, Tadahiro
We propose a novel speaker recognition approach using a speaker-independent universal acoustic model (UAM) for sensornet applications. In sensornet applications such as “Business Microscope”, interactions among knowledge workers in an organization can be visualized by sensing face-to-face communication using wearable sensor nodes. In conventional studies, speakers are detected by comparing energy of input speech signals among the nodes. However, there are often synchronization errors among the nodes which degrade the speaker recognition performance. By focusing on property of the speaker's acoustic channel, UAM can provide robustness against the synchronization error. The overall speaker recognition accuracy is improved by combining UAM with the energy-based approach. For 0.1s speech inputs and 4 subjects, speaker recognition accuracy of 94% is achieved at the synchronization error less than 100ms.
Multimodal fusion of polynomial classifiers for automatic person recgonition
NASA Astrophysics Data System (ADS)
Broun, Charles C.; Zhang, Xiaozheng
2001-03-01
With the prevalence of the information age, privacy and personalization are forefront in today's society. As such, biometrics are viewed as essential components of current evolving technological systems. Consumers demand unobtrusive and non-invasive approaches. In our previous work, we have demonstrated a speaker verification system that meets these criteria. However, there are additional constraints for fielded systems. The required recognition transactions are often performed in adverse environments and across diverse populations, necessitating robust solutions. There are two significant problem areas in current generation speaker verification systems. The first is the difficulty in acquiring clean audio signals in all environments without encumbering the user with a head- mounted close-talking microphone. Second, unimodal biometric systems do not work with a significant percentage of the population. To combat these issues, multimodal techniques are being investigated to improve system robustness to environmental conditions, as well as improve overall accuracy across the population. We propose a multi modal approach that builds on our current state-of-the-art speaker verification technology. In order to maintain the transparent nature of the speech interface, we focus on optical sensing technology to provide the additional modality-giving us an audio-visual person recognition system. For the audio domain, we use our existing speaker verification system. For the visual domain, we focus on lip motion. This is chosen, rather than static face or iris recognition, because it provides dynamic information about the individual. In addition, the lip dynamics can aid speech recognition to provide liveness testing. The visual processing method makes use of both color and edge information, combined within Markov random field MRF framework, to localize the lips. Geometric features are extracted and input to a polynomial classifier for the person recognition process. A late integration approach, based on a probabilistic model, is employed to combine the two modalities. The system is tested on the XM2VTS database combined with AWGN in the audio domain over a range of signal-to-noise ratios.
Schelinski, Stefanie; Riedel, Philipp; von Kriegstein, Katharina
2014-12-01
In auditory-only conditions, for example when we listen to someone on the phone, it is essential to fast and accurately recognize what is said (speech recognition). Previous studies have shown that speech recognition performance in auditory-only conditions is better if the speaker is known not only by voice, but also by face. Here, we tested the hypothesis that such an improvement in auditory-only speech recognition depends on the ability to lip-read. To test this we recruited a group of adults with autism spectrum disorder (ASD), a condition associated with difficulties in lip-reading, and typically developed controls. All participants were trained to identify six speakers by name and voice. Three speakers were learned by a video showing their face and three others were learned in a matched control condition without face. After training, participants performed an auditory-only speech recognition test that consisted of sentences spoken by the trained speakers. As a control condition, the test also included speaker identity recognition on the same auditory material. The results showed that, in the control group, performance in speech recognition was improved for speakers known by face in comparison to speakers learned in the matched control condition without face. The ASD group lacked such a performance benefit. For the ASD group auditory-only speech recognition was even worse for speakers known by face compared to speakers not known by face. In speaker identity recognition, the ASD group performed worse than the control group independent of whether the speakers were learned with or without face. Two additional visual experiments showed that the ASD group performed worse in lip-reading whereas face identity recognition was within the normal range. The findings support the view that auditory-only communication involves specific visual mechanisms. Further, they indicate that in ASD, speaker-specific dynamic visual information is not available to optimize auditory-only speech recognition. Copyright © 2014 Elsevier Ltd. All rights reserved.
The 2016 NIST Speaker Recognition Evaluation
2017-08-20
The 2016 NIST Speaker Recognition Evaluation Seyed Omid Sadjadi1,∗, Timothée Kheyrkhah1,†, Audrey Tong1, Craig Greenberg1, Douglas Reynolds2, Elliot...recent in an ongoing series of speaker recognition evaluations (SRE) to foster research in ro- bust text-independent speaker recognition, as well as...online evaluation platform, a fixed training data condition, more variability in test segment duration (uni- formly distributed between 10s and 60s
A speech-controlled environmental control system for people with severe dysarthria.
Hawley, Mark S; Enderby, Pam; Green, Phil; Cunningham, Stuart; Brownsell, Simon; Carmichael, James; Parker, Mark; Hatzis, Athanassios; O'Neill, Peter; Palmer, Rebecca
2007-06-01
Automatic speech recognition (ASR) can provide a rapid means of controlling electronic assistive technology. Off-the-shelf ASR systems function poorly for users with severe dysarthria because of the increased variability of their articulations. We have developed a limited vocabulary speaker dependent speech recognition application which has greater tolerance to variability of speech, coupled with a computerised training package which assists dysarthric speakers to improve the consistency of their vocalisations and provides more data for recogniser training. These applications, and their implementation as the interface for a speech-controlled environmental control system (ECS), are described. The results of field trials to evaluate the training program and the speech-controlled ECS are presented. The user-training phase increased the recognition rate from 88.5% to 95.4% (p<0.001). Recognition rates were good for people with even the most severe dysarthria in everyday usage in the home (mean word recognition rate 86.9%). Speech-controlled ECS were less accurate (mean task completion accuracy 78.6% versus 94.8%) but were faster to use than switch-scanning systems, even taking into account the need to repeat unsuccessful operations (mean task completion time 7.7s versus 16.9s, p<0.001). It is concluded that a speech-controlled ECS is a viable alternative to switch-scanning systems for some people with severe dysarthria and would lead, in many cases, to more efficient control of the home.
Tone classification of syllable-segmented Thai speech based on multilayer perception
NASA Astrophysics Data System (ADS)
Satravaha, Nuttavudh; Klinkhachorn, Powsiri; Lass, Norman
2002-05-01
Thai is a monosyllabic tonal language that uses tone to convey lexical information about the meaning of a syllable. Thus to completely recognize a spoken Thai syllable, a speech recognition system not only has to recognize a base syllable but also must correctly identify a tone. Hence, tone classification of Thai speech is an essential part of a Thai speech recognition system. Thai has five distinctive tones (``mid,'' ``low,'' ``falling,'' ``high,'' and ``rising'') and each tone is represented by a single fundamental frequency (F0) pattern. However, several factors, including tonal coarticulation, stress, intonation, and speaker variability, affect the F0 pattern of a syllable in continuous Thai speech. In this study, an efficient method for tone classification of syllable-segmented Thai speech, which incorporates the effects of tonal coarticulation, stress, and intonation, as well as a method to perform automatic syllable segmentation, were developed. Acoustic parameters were used as the main discriminating parameters. The F0 contour of a segmented syllable was normalized by using a z-score transformation before being presented to a tone classifier. The proposed system was evaluated on 920 test utterances spoken by 8 speakers. A recognition rate of 91.36% was achieved by the proposed system.
Fifty years of progress in speech and speaker recognition
NASA Astrophysics Data System (ADS)
Furui, Sadaoki
2004-10-01
Speech and speaker recognition technology has made very significant progress in the past 50 years. The progress can be summarized by the following changes: (1) from template matching to corpus-base statistical modeling, e.g., HMM and n-grams, (2) from filter bank/spectral resonance to Cepstral features (Cepstrum + DCepstrum + DDCepstrum), (3) from heuristic time-normalization to DTW/DP matching, (4) from gdistanceh-based to likelihood-based methods, (5) from maximum likelihood to discriminative approach, e.g., MCE/GPD and MMI, (6) from isolated word to continuous speech recognition, (7) from small vocabulary to large vocabulary recognition, (8) from context-independent units to context-dependent units for recognition, (9) from clean speech to noisy/telephone speech recognition, (10) from single speaker to speaker-independent/adaptive recognition, (11) from monologue to dialogue/conversation recognition, (12) from read speech to spontaneous speech recognition, (13) from recognition to understanding, (14) from single-modality (audio signal only) to multi-modal (audio/visual) speech recognition, (15) from hardware recognizer to software recognizer, and (16) from no commercial application to many practical commercial applications. Most of these advances have taken place in both the fields of speech recognition and speaker recognition. The majority of technological changes have been directed toward the purpose of increasing robustness of recognition, including many other additional important techniques not noted above.
Cost-sensitive learning for emotion robust speaker recognition.
Li, Dongdong; Yang, Yingchun; Dai, Weihui
2014-01-01
In the field of information security, voice is one of the most important parts in biometrics. Especially, with the development of voice communication through the Internet or telephone system, huge voice data resources are accessed. In speaker recognition, voiceprint can be applied as the unique password for the user to prove his/her identity. However, speech with various emotions can cause an unacceptably high error rate and aggravate the performance of speaker recognition system. This paper deals with this problem by introducing a cost-sensitive learning technology to reweight the probability of test affective utterances in the pitch envelop level, which can enhance the robustness in emotion-dependent speaker recognition effectively. Based on that technology, a new architecture of recognition system as well as its components is proposed in this paper. The experiment conducted on the Mandarin Affective Speech Corpus shows that an improvement of 8% identification rate over the traditional speaker recognition is achieved.
Cost-Sensitive Learning for Emotion Robust Speaker Recognition
Li, Dongdong; Yang, Yingchun
2014-01-01
In the field of information security, voice is one of the most important parts in biometrics. Especially, with the development of voice communication through the Internet or telephone system, huge voice data resources are accessed. In speaker recognition, voiceprint can be applied as the unique password for the user to prove his/her identity. However, speech with various emotions can cause an unacceptably high error rate and aggravate the performance of speaker recognition system. This paper deals with this problem by introducing a cost-sensitive learning technology to reweight the probability of test affective utterances in the pitch envelop level, which can enhance the robustness in emotion-dependent speaker recognition effectively. Based on that technology, a new architecture of recognition system as well as its components is proposed in this paper. The experiment conducted on the Mandarin Affective Speech Corpus shows that an improvement of 8% identification rate over the traditional speaker recognition is achieved. PMID:24999492
Alternative Speech Communication System for Persons with Severe Speech Disorders
NASA Astrophysics Data System (ADS)
Selouani, Sid-Ahmed; Sidi Yakoub, Mohammed; O'Shaughnessy, Douglas
2009-12-01
Assistive speech-enabled systems are proposed to help both French and English speaking persons with various speech disorders. The proposed assistive systems use automatic speech recognition (ASR) and speech synthesis in order to enhance the quality of communication. These systems aim at improving the intelligibility of pathologic speech making it as natural as possible and close to the original voice of the speaker. The resynthesized utterances use new basic units, a new concatenating algorithm and a grafting technique to correct the poorly pronounced phonemes. The ASR responses are uttered by the new speech synthesis system in order to convey an intelligible message to listeners. Experiments involving four American speakers with severe dysarthria and two Acadian French speakers with sound substitution disorders (SSDs) are carried out to demonstrate the efficiency of the proposed methods. An improvement of the Perceptual Evaluation of the Speech Quality (PESQ) value of 5% and more than 20% is achieved by the speech synthesis systems that deal with SSD and dysarthria, respectively.
MARTI: man-machine animation real-time interface
NASA Astrophysics Data System (ADS)
Jones, Christian M.; Dlay, Satnam S.
1997-05-01
The research introduces MARTI (man-machine animation real-time interface) for the realization of natural human-machine interfacing. The system uses simple vocal sound-tracks of human speakers to provide lip synchronization of computer graphical facial models. We present novel research in a number of engineering disciplines, which include speech recognition, facial modeling, and computer animation. This interdisciplinary research utilizes the latest, hybrid connectionist/hidden Markov model, speech recognition system to provide very accurate phone recognition and timing for speaker independent continuous speech, and expands on knowledge from the animation industry in the development of accurate facial models and automated animation. The research has many real-world applications which include the provision of a highly accurate and 'natural' man-machine interface to assist user interactions with computer systems and communication with one other using human idiosyncrasies; a complete special effects and animation toolbox providing automatic lip synchronization without the normal constraints of head-sets, joysticks, and skilled animators; compression of video data to well below standard telecommunication channel bandwidth for video communications and multi-media systems; assisting speech training and aids for the handicapped; and facilitating player interaction for 'video gaming' and 'virtual worlds.' MARTI has introduced a new level of realism to man-machine interfacing and special effect animation which has been previously unseen.
Automatic speech recognition (ASR) based approach for speech therapy of aphasic patients: A review
NASA Astrophysics Data System (ADS)
Jamal, Norezmi; Shanta, Shahnoor; Mahmud, Farhanahani; Sha'abani, MNAH
2017-09-01
This paper reviews the state-of-the-art an automatic speech recognition (ASR) based approach for speech therapy of aphasic patients. Aphasia is a condition in which the affected person suffers from speech and language disorder resulting from a stroke or brain injury. Since there is a growing body of evidence indicating the possibility of improving the symptoms at an early stage, ASR based solutions are increasingly being researched for speech and language therapy. ASR is a technology that transfers human speech into transcript text by matching with the system's library. This is particularly useful in speech rehabilitation therapies as they provide accurate, real-time evaluation for speech input from an individual with speech disorder. ASR based approaches for speech therapy recognize the speech input from the aphasic patient and provide real-time feedback response to their mistakes. However, the accuracy of ASR is dependent on many factors such as, phoneme recognition, speech continuity, speaker and environmental differences as well as our depth of knowledge on human language understanding. Hence, the review examines recent development of ASR technologies and its performance for individuals with speech and language disorders.
Talker and accent variability effects on spoken word recognition
NASA Astrophysics Data System (ADS)
Nyang, Edna E.; Rogers, Catherine L.; Nishi, Kanae
2003-04-01
A number of studies have shown that words in a list are recognized less accurately in noise and with longer response latencies when they are spoken by multiple talkers, rather than a single talker. These results have been interpreted as support for an exemplar-based model of speech perception, in which it is assumed that detailed information regarding the speaker's voice is preserved in memory and used in recognition, rather than being eliminated via normalization. In the present study, the effects of varying both accent and talker are investigated using lists of words spoken by (a) a single native English speaker, (b) six native English speakers, (c) three native English speakers and three Japanese-accented English speakers. Twelve /hVd/ words were mixed with multi-speaker babble at three signal-to-noise ratios (+10, +5, and 0 dB) to create the word lists. Native English-speaking listeners' percent-correct recognition for words produced by native English speakers across the three talker conditions (single talker native, multi-talker native, and multi-talker mixed native and non-native) and three signal-to-noise ratios will be compared to determine whether sources of speaker variability other than voice alone add to the processing demands imposed by simple (i.e., single accent) speaker variability in spoken word recognition.
Speaker information affects false recognition of unstudied lexical-semantic associates.
Luthra, Sahil; Fox, Neal P; Blumstein, Sheila E
2018-05-01
Recognition of and memory for a spoken word can be facilitated by a prior presentation of that word spoken by the same talker. However, it is less clear whether this speaker congruency advantage generalizes to facilitate recognition of unheard related words. The present investigation employed a false memory paradigm to examine whether information about a speaker's identity in items heard by listeners could influence the recognition of novel items (critical intruders) phonologically or semantically related to the studied items. In Experiment 1, false recognition of semantically associated critical intruders was sensitive to speaker information, though only when subjects attended to talker identity during encoding. Results from Experiment 2 also provide some evidence that talker information affects the false recognition of critical intruders. Taken together, the present findings indicate that indexical information is able to contact the lexical-semantic network to affect the processing of unheard words.
Gardiner, John M; Gregg, Vernon H; Karayianni, Irene
2006-03-01
We report four experiments in which a remember-know paradigm was combined with a response deadline procedure in order to assess memory awareness in fast, as compared with slow,recognition judgments. In the experiments, we also investigated the perceptual effects of study-test congruence, either for picture size or for speaker's voice, following either full or divided attention at study. These perceptual effects occurred in remembering with full attention and in knowing with divided attention, but they were uninfluenced by recognition speed, indicating that their occurrence in remembering or knowing depends more on conscious resources at encoding than on those at retrieval. The results have implications for theoretical accounts of remembering and knowing that assume that remembering is more consciously controlled and effortful, whereas knowing is more automatic and faster.
Artificially intelligent recognition of Arabic speaker using voice print-based local features
NASA Astrophysics Data System (ADS)
Mahmood, Awais; Alsulaiman, Mansour; Muhammad, Ghulam; Akram, Sheeraz
2016-11-01
Local features for any pattern recognition system are based on the information extracted locally. In this paper, a local feature extraction technique was developed. This feature was extracted in the time-frequency plain by taking the moving average on the diagonal directions of the time-frequency plane. This feature captured the time-frequency events producing a unique pattern for each speaker that can be viewed as a voice print of the speaker. Hence, we referred to this technique as voice print-based local feature. The proposed feature was compared to other features including mel-frequency cepstral coefficient (MFCC) for speaker recognition using two different databases. One of the databases used in the comparison is a subset of an LDC database that consisted of two short sentences uttered by 182 speakers. The proposed feature attained 98.35% recognition rate compared to 96.7% for MFCC using the LDC subset.
Real-time speech gisting for ATC applications
NASA Astrophysics Data System (ADS)
Dunkelberger, Kirk A.
1995-06-01
Command and control within the ATC environment remains primarily voice-based. Hence, automatic real time, speaker independent, continuous speech recognition (CSR) has many obvious applications and implied benefits to the ATC community: automated target tagging, aircraft compliance monitoring, controller training, automatic alarm disabling, display management, and many others. However, while current state-of-the-art CSR systems provide upwards of 98% word accuracy in laboratory environments, recent low-intrusion experiments in the ATCT environments demonstrated less than 70% word accuracy in spite of significant investments in recognizer tuning. Acoustic channel irregularities and controller/pilot grammar verities impact current CSR algorithms at their weakest points. It will be shown herein, however, that real time context- and environment-sensitive gisting can provide key command phrase recognition rates of greater than 95% using the same low-intrusion approach. The combination of real time inexact syntactic pattern recognition techniques and a tight integration of CSR, gisting, and ATC database accessor system components is the key to these high phase recognition rates. A system concept for real time gisting in the ATC context is presented herein. After establishing an application context, discussion presents a minimal CSR technology context then focuses on the gisting mechanism, desirable interfaces into the ATCT database environment, and data and control flow within the prototype system. Results of recent tests for a subset of the functionality are presented together with suggestions for further research.
DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1
NASA Astrophysics Data System (ADS)
Garofolo, J. S.; Lamel, L. F.; Fisher, W. M.; Fiscus, J. G.; Pallett, D. S.
1993-02-01
The Texas Instruments/Massachusetts Institute of Technology (TIMIT) corpus of read speech has been designed to provide speech data for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition systems. TIMIT contains speech from 630 speakers representing 8 major dialect divisions of American English, each speaking 10 phonetically-rich sentences. The TIMIT corpus includes time-aligned orthographic, phonetic, and word transcriptions, as well as speech waveform data for each spoken sentence. The release of TIMIT contains several improvements over the Prototype CD-ROM released in December, 1988: (1) full 630-speaker corpus, (2) checked and corrected transcriptions, (3) word-alignment transcriptions, (4) NIST SPHERE-headered waveform files and header manipulation software, (5) phonemic dictionary, (6) new test and training subsets balanced for dialectal and phonetic coverage, and (7) more extensive documentation.
Speaker Recognition Using Real vs. Synthetic Parallel Data for DNN Channel Compensation
2016-08-18
Speaker Recognition Using Real vs Synthetic Parallel Data for DNN Channel Compensation Fred Richardson, Michael Brandstein, Jennifer Melot and...de- noising DNNs has been demonstrated for several speech tech- nologies such as ASR and speaker recognition. This paper com- pares the use of real ...AVG and POOL min DCFs). In all cases, the telephone channel per- formance on SRE10 is improved by the denoising DNNs with the real Mixer 1 and 2
Speaker Recognition Using Real vs Synthetic Parallel Data for DNN Channel Compensation
2016-09-08
Speaker Recognition Using Real vs Synthetic Parallel Data for DNN Channel Compensation Fred Richardson, Michael Brandstein, Jennifer Melot and...de- noising DNNs has been demonstrated for several speech tech- nologies such as ASR and speaker recognition. This paper com- pares the use of real ...AVG and POOL min DCFs). In all cases, the telephone channel per- formance on SRE10 is improved by the denoising DNNs with the real Mixer 1 and 2
How Psychological Stress Affects Emotional Prosody.
Paulmann, Silke; Furnes, Desire; Bøkenes, Anne Ming; Cozzolino, Philip J
2016-01-01
We explored how experimentally induced psychological stress affects the production and recognition of vocal emotions. In Study 1a, we demonstrate that sentences spoken by stressed speakers are judged by naïve listeners as sounding more stressed than sentences uttered by non-stressed speakers. In Study 1b, negative emotions produced by stressed speakers are generally less well recognized than the same emotions produced by non-stressed speakers. Multiple mediation analyses suggest this poorer recognition of negative stimuli was due to a mismatch between the variation of volume voiced by speakers and the range of volume expected by listeners. Together, this suggests that the stress level of the speaker affects judgments made by the receiver. In Study 2, we demonstrate that participants who were induced with a feeling of stress before carrying out an emotional prosody recognition task performed worse than non-stressed participants. Overall, findings suggest detrimental effects of induced stress on interpersonal sensitivity.
How Psychological Stress Affects Emotional Prosody
Paulmann, Silke; Furnes, Desire; Bøkenes, Anne Ming; Cozzolino, Philip J.
2016-01-01
We explored how experimentally induced psychological stress affects the production and recognition of vocal emotions. In Study 1a, we demonstrate that sentences spoken by stressed speakers are judged by naïve listeners as sounding more stressed than sentences uttered by non-stressed speakers. In Study 1b, negative emotions produced by stressed speakers are generally less well recognized than the same emotions produced by non-stressed speakers. Multiple mediation analyses suggest this poorer recognition of negative stimuli was due to a mismatch between the variation of volume voiced by speakers and the range of volume expected by listeners. Together, this suggests that the stress level of the speaker affects judgments made by the receiver. In Study 2, we demonstrate that participants who were induced with a feeling of stress before carrying out an emotional prosody recognition task performed worse than non-stressed participants. Overall, findings suggest detrimental effects of induced stress on interpersonal sensitivity. PMID:27802287
Building Searchable Collections of Enterprise Speech Data.
ERIC Educational Resources Information Center
Cooper, James W.; Viswanathan, Mahesh; Byron, Donna; Chan, Margaret
The study has applied speech recognition and text-mining technologies to a set of recorded outbound marketing calls and analyzed the results. Since speaker-independent speech recognition technology results in a significantly lower recognition rate than that found when the recognizer is trained for a particular speaker, a number of post-processing…
The Development of Word Recognition in a Second Language.
ERIC Educational Resources Information Center
Muljani, D.; Koda, Keiko; Moates, Danny R.
1998-01-01
A study investigated differences in English word recognition in native speakers of Indonesian (an alphabetic language) and Chinese (a logographic languages) learning English as a Second Language. Results largely confirmed the hypothesis that an alphabetic first language would predict better word recognition in speakers of an alphabetic language,…
Voice Recognition Software Accuracy with Second Language Speakers of English.
ERIC Educational Resources Information Center
Coniam, D.
1999-01-01
Explores the potential of the use of voice-recognition technology with second-language speakers of English. Involves the analysis of the output produced by a small group of very competent second-language subjects reading a text into the voice recognition software Dragon Systems "Dragon NaturallySpeaking." (Author/VWL)
Simulation of talking faces in the human brain improves auditory speech recognition
von Kriegstein, Katharina; Dogan, Özgür; Grüter, Martina; Giraud, Anne-Lise; Kell, Christian A.; Grüter, Thomas; Kleinschmidt, Andreas; Kiebel, Stefan J.
2008-01-01
Human face-to-face communication is essentially audiovisual. Typically, people talk to us face-to-face, providing concurrent auditory and visual input. Understanding someone is easier when there is visual input, because visual cues like mouth and tongue movements provide complementary information about speech content. Here, we hypothesized that, even in the absence of visual input, the brain optimizes both auditory-only speech and speaker recognition by harvesting speaker-specific predictions and constraints from distinct visual face-processing areas. To test this hypothesis, we performed behavioral and neuroimaging experiments in two groups: subjects with a face recognition deficit (prosopagnosia) and matched controls. The results show that observing a specific person talking for 2 min improves subsequent auditory-only speech and speaker recognition for this person. In both prosopagnosics and controls, behavioral improvement in auditory-only speech recognition was based on an area typically involved in face-movement processing. Improvement in speaker recognition was only present in controls and was based on an area involved in face-identity processing. These findings challenge current unisensory models of speech processing, because they show that, in auditory-only speech, the brain exploits previously encoded audiovisual correlations to optimize communication. We suggest that this optimization is based on speaker-specific audiovisual internal models, which are used to simulate a talking face. PMID:18436648
Bridge Health Monitoring Using a Machine Learning Strategy
DOT National Transportation Integrated Search
2017-01-01
The goal of this project was to cast the SHM problem within a statistical pattern recognition framework. Techniques borrowed from speaker recognition, particularly speaker verification, were used as this discipline deals with problems very similar to...
Automated Intelligibility Assessment of Pathological Speech Using Phonological Features
NASA Astrophysics Data System (ADS)
Middag, Catherine; Martens, Jean-Pierre; Van Nuffelen, Gwen; De Bodt, Marc
2009-12-01
It is commonly acknowledged that word or phoneme intelligibility is an important criterion in the assessment of the communication efficiency of a pathological speaker. People have therefore put a lot of effort in the design of perceptual intelligibility rating tests. These tests usually have the drawback that they employ unnatural speech material (e.g., nonsense words) and that they cannot fully exclude errors due to listener bias. Therefore, there is a growing interest in the application of objective automatic speech recognition technology to automate the intelligibility assessment. Current research is headed towards the design of automated methods which can be shown to produce ratings that correspond well with those emerging from a well-designed and well-performed perceptual test. In this paper, a novel methodology that is built on previous work (Middag et al., 2008) is presented. It utilizes phonological features, automatic speech alignment based on acoustic models that were trained on normal speech, context-dependent speaker feature extraction, and intelligibility prediction based on a small model that can be trained on pathological speech samples. The experimental evaluation of the new system reveals that the root mean squared error of the discrepancies between perceived and computed intelligibilities can be as low as 8 on a scale of 0 to 100.
Automatic voice recognition using traditional and artificial neural network approaches
NASA Technical Reports Server (NTRS)
Botros, Nazeih M.
1989-01-01
The main objective of this research is to develop an algorithm for isolated-word recognition. This research is focused on digital signal analysis rather than linguistic analysis of speech. Features extraction is carried out by applying a Linear Predictive Coding (LPC) algorithm with order of 10. Continuous-word and speaker independent recognition will be considered in future study after accomplishing this isolated word research. To examine the similarity between the reference and the training sets, two approaches are explored. The first is implementing traditional pattern recognition techniques where a dynamic time warping algorithm is applied to align the two sets and calculate the probability of matching by measuring the Euclidean distance between the two sets. The second is implementing a backpropagation artificial neural net model with three layers as the pattern classifier. The adaptation rule implemented in this network is the generalized least mean square (LMS) rule. The first approach has been accomplished. A vocabulary of 50 words was selected and tested. The accuracy of the algorithm was found to be around 85 percent. The second approach is in progress at the present time.
NASA Technical Reports Server (NTRS)
Simpson, C. A.
1985-01-01
In the present study of the responses of pairs of pilots to aircraft warning classification tasks using an isolated word, speaker-dependent speech recognition system, the induced stress was manipulated by means of different scoring procedures for the classification task and by the inclusion of a competitive manual control task. Both speech patterns and recognition accuracy were analyzed, and recognition errors were recorded by type for an isolated word speaker-dependent system and by an offline technique for a connected word speaker-dependent system. While errors increased with task loading for the isolated word system, there was no such effect for task loading in the case of the connected word system.
NASA Astrophysics Data System (ADS)
Poock, G. K.; Martin, B. J.
1984-02-01
This was an applied investigation examining the ability of a speech recognition system to recognize speakers' inputs when the speakers were under different stress levels. Subjects were asked to speak to a voice recognition system under three conditions: (1) normal office environment, (2) emotional stress, and (3) perceptual-motor stress. Results indicate a definite relationship between voice recognition system performance and the type of low stress reference patterns used to achieve recognition.
You had me at "Hello": Rapid extraction of dialect information from spoken words.
Scharinger, Mathias; Monahan, Philip J; Idsardi, William J
2011-06-15
Research on the neuronal underpinnings of speaker identity recognition has identified voice-selective areas in the human brain with evolutionary homologues in non-human primates who have comparable areas for processing species-specific calls. Most studies have focused on estimating the extent and location of these areas. In contrast, relatively few experiments have investigated the time-course of speaker identity, and in particular, dialect processing and identification by electro- or neuromagnetic means. We show here that dialect extraction occurs speaker-independently, pre-attentively and categorically. We used Standard American English and African-American English exemplars of 'Hello' in a magnetoencephalographic (MEG) Mismatch Negativity (MMN) experiment. The MMN as an automatic change detection response of the brain reflected dialect differences that were not entirely reducible to acoustic differences between the pronunciations of 'Hello'. Source analyses of the M100, an auditory evoked response to the vowels suggested additional processing in voice-selective areas whenever a dialect change was detected. These findings are not only relevant for the cognitive neuroscience of language, but also for the social sciences concerned with dialect and race perception. Copyright © 2011 Elsevier Inc. All rights reserved.
Visual face-movement sensitive cortex is relevant for auditory-only speech recognition.
Riedel, Philipp; Ragert, Patrick; Schelinski, Stefanie; Kiebel, Stefan J; von Kriegstein, Katharina
2015-07-01
It is commonly assumed that the recruitment of visual areas during audition is not relevant for performing auditory tasks ('auditory-only view'). According to an alternative view, however, the recruitment of visual cortices is thought to optimize auditory-only task performance ('auditory-visual view'). This alternative view is based on functional magnetic resonance imaging (fMRI) studies. These studies have shown, for example, that even if there is only auditory input available, face-movement sensitive areas within the posterior superior temporal sulcus (pSTS) are involved in understanding what is said (auditory-only speech recognition). This is particularly the case when speakers are known audio-visually, that is, after brief voice-face learning. Here we tested whether the left pSTS involvement is causally related to performance in auditory-only speech recognition when speakers are known by face. To test this hypothesis, we applied cathodal transcranial direct current stimulation (tDCS) to the pSTS during (i) visual-only speech recognition of a speaker known only visually to participants and (ii) auditory-only speech recognition of speakers they learned by voice and face. We defined the cathode as active electrode to down-regulate cortical excitability by hyperpolarization of neurons. tDCS to the pSTS interfered with visual-only speech recognition performance compared to a control group without pSTS stimulation (tDCS to BA6/44 or sham). Critically, compared to controls, pSTS stimulation additionally decreased auditory-only speech recognition performance selectively for voice-face learned speakers. These results are important in two ways. First, they provide direct evidence that the pSTS is causally involved in visual-only speech recognition; this confirms a long-standing prediction of current face-processing models. Secondly, they show that visual face-sensitive pSTS is causally involved in optimizing auditory-only speech recognition. These results are in line with the 'auditory-visual view' of auditory speech perception, which assumes that auditory speech recognition is optimized by using predictions from previously encoded speaker-specific audio-visual internal models. Copyright © 2015 Elsevier Ltd. All rights reserved.
Shin, Young Hoon; Seo, Jiwon
2016-01-01
People with hearing or speaking disabilities are deprived of the benefits of conventional speech recognition technology because it is based on acoustic signals. Recent research has focused on silent speech recognition systems that are based on the motions of a speaker’s vocal tract and articulators. Because most silent speech recognition systems use contact sensors that are very inconvenient to users or optical systems that are susceptible to environmental interference, a contactless and robust solution is hence required. Toward this objective, this paper presents a series of signal processing algorithms for a contactless silent speech recognition system using an impulse radio ultra-wide band (IR-UWB) radar. The IR-UWB radar is used to remotely and wirelessly detect motions of the lips and jaw. In order to extract the necessary features of lip and jaw motions from the received radar signals, we propose a feature extraction algorithm. The proposed algorithm noticeably improved speech recognition performance compared to the existing algorithm during our word recognition test with five speakers. We also propose a speech activity detection algorithm to automatically select speech segments from continuous input signals. Thus, speech recognition processing is performed only when speech segments are detected. Our testbed consists of commercial off-the-shelf radar products, and the proposed algorithms are readily applicable without designing specialized radar hardware for silent speech processing. PMID:27801867
Recognition of speaker-dependent continuous speech with KEAL
NASA Astrophysics Data System (ADS)
Mercier, G.; Bigorgne, D.; Miclet, L.; Le Guennec, L.; Querre, M.
1989-04-01
A description of the speaker-dependent continuous speech recognition system KEAL is given. An unknown utterance, is recognized by means of the followng procedures: acoustic analysis, phonetic segmentation and identification, word and sentence analysis. The combination of feature-based, speaker-independent coarse phonetic segmentation with speaker-dependent statistical classification techniques is one of the main design features of the acoustic-phonetic decoder. The lexical access component is essentially based on a statistical dynamic programming technique which aims at matching a phonemic lexical entry containing various phonological forms, against a phonetic lattice. Sentence recognition is achieved by use of a context-free grammar and a parsing algorithm derived from Earley's parser. A speaker adaptation module allows some of the system parameters to be adjusted by matching known utterances with their acoustical representation. The task to be performed, described by its vocabulary and its grammar, is given as a parameter of the system. Continuously spoken sentences extracted from a 'pseudo-Logo' language are analyzed and results are presented.
Yoneyama, Kiyoko; Munson, Benjamin
2017-02-01
Whether or not the influence of listeners' language proficiency on L2 speech recognition was affected by the structure of the lexicon was examined. This specific experiment examined the effect of word frequency (WF) and phonological neighborhood density (PND) on word recognition in native speakers of English and second-language (L2) speakers of English whose first language was Japanese. The stimuli included English words produced by a native speaker of English and English words produced by a native speaker of Japanese (i.e., with Japanese-accented English). The experiment was inspired by the finding of Imai, Flege, and Walley [(2005). J. Acoust. Soc. Am. 117, 896-907] that the influence of talker accent on speech intelligibility for L2 learners of English whose L1 is Spanish varies as a function of words' PND. In the currently study, significant interactions between stimulus accentedness and listener group on the accuracy and speed of spoken word recognition were found, as were significant effects of PND and WF on word-recognition accuracy. However, no significant three-way interaction among stimulus talker, listener group, and PND on either measure was found. Results are discussed in light of recent findings on cross-linguistic differences in the nature of the effects of PND on L2 phonological and lexical processing.
NASA Astrophysics Data System (ADS)
Anagnostopoulos, Christos Nikolaos; Vovoli, Eftichia
An emotion recognition framework based on sound processing could improve services in human-computer interaction. Various quantitative speech features obtained from sound processing of acting speech were tested, as to whether they are sufficient or not to discriminate between seven emotions. Multilayered perceptrons were trained to classify gender and emotions on the basis of a 24-input vector, which provide information about the prosody of the speaker over the entire sentence using statistics of sound features. Several experiments were performed and the results were presented analytically. Emotion recognition was successful when speakers and utterances were “known” to the classifier. However, severe misclassifications occurred during the utterance-independent framework. At least, the proposed feature vector achieved promising results for utterance-independent recognition of high- and low-arousal emotions.
Bilingual Computerized Speech Recognition Screening for Depression Symptoms
ERIC Educational Resources Information Center
Gonzalez, Gerardo; Carter, Colby; Blanes, Erika
2007-01-01
The Voice-Interactive Depression Assessment System (VIDAS) is a computerized speech recognition application for screening depression based on the Center for Epidemiological Studies--Depression scale in English and Spanish. Study 1 included 50 English and 47 Spanish speakers. Study 2 involved 108 English and 109 Spanish speakers. Participants…
Kreitewolf, Jens; Friederici, Angela D; von Kriegstein, Katharina
2014-11-15
Hemispheric specialization for linguistic prosody is a controversial issue. While it is commonly assumed that linguistic prosody and emotional prosody are preferentially processed in the right hemisphere, neuropsychological work directly comparing processes of linguistic prosody and emotional prosody suggests a predominant role of the left hemisphere for linguistic prosody processing. Here, we used two functional magnetic resonance imaging (fMRI) experiments to clarify the role of left and right hemispheres in the neural processing of linguistic prosody. In the first experiment, we sought to confirm previous findings showing that linguistic prosody processing compared to other speech-related processes predominantly involves the right hemisphere. Unlike previous studies, we controlled for stimulus influences by employing a prosody and speech task using the same speech material. The second experiment was designed to investigate whether a left-hemispheric involvement in linguistic prosody processing is specific to contrasts between linguistic prosody and emotional prosody or whether it also occurs when linguistic prosody is contrasted against other non-linguistic processes (i.e., speaker recognition). Prosody and speaker tasks were performed on the same stimulus material. In both experiments, linguistic prosody processing was associated with activity in temporal, frontal, parietal and cerebellar regions. Activation in temporo-frontal regions showed differential lateralization depending on whether the control task required recognition of speech or speaker: recognition of linguistic prosody predominantly involved right temporo-frontal areas when it was contrasted against speech recognition; when contrasted against speaker recognition, recognition of linguistic prosody predominantly involved left temporo-frontal areas. The results show that linguistic prosody processing involves functions of both hemispheres and suggest that recognition of linguistic prosody is based on an inter-hemispheric mechanism which exploits both a right-hemispheric sensitivity to pitch information and a left-hemispheric dominance in speech processing. Copyright © 2014 Elsevier Inc. All rights reserved.
NASA Astrophysics Data System (ADS)
Palaniswamy, Sumithra; Duraisamy, Prakash; Alam, Mohammad Showkat; Yuan, Xiaohui
2012-04-01
Automatic speech processing systems are widely used in everyday life such as mobile communication, speech and speaker recognition, and for assisting the hearing impaired. In speech communication systems, the quality and intelligibility of speech is of utmost importance for ease and accuracy of information exchange. To obtain an intelligible speech signal and one that is more pleasant to listen, noise reduction is essential. In this paper a new Time Adaptive Discrete Bionic Wavelet Thresholding (TADBWT) scheme is proposed. The proposed technique uses Daubechies mother wavelet to achieve better enhancement of speech from additive non- stationary noises which occur in real life such as street noise and factory noise. Due to the integration of human auditory system model into the wavelet transform, bionic wavelet transform (BWT) has great potential for speech enhancement which may lead to a new path in speech processing. In the proposed technique, at first, discrete BWT is applied to noisy speech to derive TADBWT coefficients. Then the adaptive nature of the BWT is captured by introducing a time varying linear factor which updates the coefficients at each scale over time. This approach has shown better performance than the existing algorithms at lower input SNR due to modified soft level dependent thresholding on time adaptive coefficients. The objective and subjective test results confirmed the competency of the TADBWT technique. The effectiveness of the proposed technique is also evaluated for speaker recognition task under noisy environment. The recognition results show that the TADWT technique yields better performance when compared to alternate methods specifically at lower input SNR.
2015-10-01
Scoring, Gaussian Backend , etc.) as shown in Fig. 39. The methods in this domain also emphasized the ability to perform data purification for both...investigation using the same infrastructure was undertaken to explore Lombard effect “flavor” detection for improved speaker ID. The study The presence of...dimension selection and compared to a common N-gram frequency based selection. 2.1.2: Exploration on NN/DBN backend : Since Deep Neural Networks (DNN) have
Speaker Linking and Applications using Non-Parametric Hashing Methods
2016-09-08
clustering method based on hashing—canopy- clustering . We apply this method to a large corpus of speaker recordings, demonstrate performance tradeoffs...and compare to other hash- ing methods. Index Terms: speaker recognition, clustering , hashing, locality sensitive hashing. 1. Introduction We assume...speaker in our corpus. Second, given a QBE method, how can we perform speaker clustering —each clustering should be a single speaker, and a cluster should
Schall, Sonja; von Kriegstein, Katharina
2014-01-01
It has been proposed that internal simulation of the talking face of visually-known speakers facilitates auditory speech recognition. One prediction of this view is that brain areas involved in auditory-only speech comprehension interact with visual face-movement sensitive areas, even under auditory-only listening conditions. Here, we test this hypothesis using connectivity analyses of functional magnetic resonance imaging (fMRI) data. Participants (17 normal participants, 17 developmental prosopagnosics) first learned six speakers via brief voice-face or voice-occupation training (<2 min/speaker). This was followed by an auditory-only speech recognition task and a control task (voice recognition) involving the learned speakers' voices in the MRI scanner. As hypothesized, we found that, during speech recognition, familiarity with the speaker's face increased the functional connectivity between the face-movement sensitive posterior superior temporal sulcus (STS) and an anterior STS region that supports auditory speech intelligibility. There was no difference between normal participants and prosopagnosics. This was expected because previous findings have shown that both groups use the face-movement sensitive STS to optimize auditory-only speech comprehension. Overall, the present findings indicate that learned visual information is integrated into the analysis of auditory-only speech and that this integration results from the interaction of task-relevant face-movement and auditory speech-sensitive areas.
Sensory Intelligence for Extraction of an Abstract Auditory Rule: A Cross-Linguistic Study.
Guo, Xiao-Tao; Wang, Xiao-Dong; Liang, Xiu-Yuan; Wang, Ming; Chen, Lin
2018-02-21
In a complex linguistic environment, while speech sounds can greatly vary, some shared features are often invariant. These invariant features constitute so-called abstract auditory rules. Our previous study has shown that with auditory sensory intelligence, the human brain can automatically extract the abstract auditory rules in the speech sound stream, presumably serving as the neural basis for speech comprehension. However, whether the sensory intelligence for extraction of abstract auditory rules in speech is inherent or experience-dependent remains unclear. To address this issue, we constructed a complex speech sound stream using auditory materials in Mandarin Chinese, in which syllables had a flat lexical tone but differed in other acoustic features to form an abstract auditory rule. This rule was occasionally and randomly violated by the syllables with the rising, dipping or falling tone. We found that both Chinese and foreign speakers detected the violations of the abstract auditory rule in the speech sound stream at a pre-attentive stage, as revealed by the whole-head recordings of mismatch negativity (MMN) in a passive paradigm. However, MMNs peaked earlier in Chinese speakers than in foreign speakers. Furthermore, Chinese speakers showed different MMN peak latencies for the three deviant types, which paralleled recognition points. These findings indicate that the sensory intelligence for extraction of abstract auditory rules in speech sounds is innate but shaped by language experience. Copyright © 2018 IBRO. Published by Elsevier Ltd. All rights reserved.
Speaker Clustering for a Mixture of Singing and Reading (Preprint)
2012-03-01
diarization [2, 3] which answers the ques- tion of ”who spoke when?” is a combination of speaker segmentation and clustering. Although it is possible to...focuses on speaker clustering, the techniques developed here can be applied to speaker diarization . For the remainder of this paper, the term ”speech...and retrieval,” Proceedings of the IEEE, vol. 88, 2000. [2] S. Tranter and D. Reynolds, “An overview of automatic speaker diarization systems,” IEEE
Intonation and dialog context as constraints for speech recognition.
Taylor, P; King, S; Isard, S; Wright, H
1998-01-01
This paper describes a way of using intonation and dialog context to improve the performance of an automatic speech recognition (ASR) system. Our experiments were run on the DCIEM Maptask corpus, a corpus of spontaneous task-oriented dialog speech. This corpus has been tagged according to a dialog analysis scheme that assigns each utterance to one of 12 "move types," such as "acknowledge," "query-yes/no" or "instruct." Most ASR systems use a bigram language model to constrain the possible sequences of words that might be recognized. Here we use a separate bigram language model for each move type. We show that when the "correct" move-specific language model is used for each utterance in the test set, the word error rate of the recognizer drops. Of course when the recognizer is run on previously unseen data, it cannot know in advance what move type the speaker has just produced. To determine the move type we use an intonation model combined with a dialog model that puts constraints on possible sequences of move types, as well as the speech recognizer likelihoods for the different move-specific models. In the full recognition system, the combination of automatic move type recognition with the move specific language models reduces the overall word error rate by a small but significant amount when compared with a baseline system that does not take intonation or dialog acts into account. Interestingly, the word error improvement is restricted to "initiating" move types, where word recognition is important. In "response" move types, where the important information is conveyed by the move type itself--for example, positive versus negative response--there is no word error improvement, but recognition of the response types themselves is good. The paper discusses the intonation model, the language models, and the dialog model in detail and describes the architecture in which they are combined.
Speaker gender identification based on majority vote classifiers
NASA Astrophysics Data System (ADS)
Mezghani, Eya; Charfeddine, Maha; Nicolas, Henri; Ben Amar, Chokri
2017-03-01
Speaker gender identification is considered among the most important tools in several multimedia applications namely in automatic speech recognition, interactive voice response systems and audio browsing systems. Gender identification systems performance is closely linked to the selected feature set and the employed classification model. Typical techniques are based on selecting the best performing classification method or searching optimum tuning of one classifier parameters through experimentation. In this paper, we consider a relevant and rich set of features involving pitch, MFCCs as well as other temporal and frequency-domain descriptors. Five classification models including decision tree, discriminant analysis, nave Bayes, support vector machine and k-nearest neighbor was experimented. The three best perming classifiers among the five ones will contribute by majority voting between their scores. Experimentations were performed on three different datasets spoken in three languages: English, German and Arabic in order to validate language independency of the proposed scheme. Results confirm that the presented system has reached a satisfying accuracy rate and promising classification performance thanks to the discriminating abilities and diversity of the used features combined with mid-level statistics.
DOE Office of Scientific and Technical Information (OSTI.GOV)
McClanahan, Richard; De Leon, Phillip L.
The majority of state-of-the-art speaker recognition systems (SR) utilize speaker models that are derived from an adapted universal background model (UBM) in the form of a Gaussian mixture model (GMM). This is true for GMM supervector systems, joint factor analysis systems, and most recently i-vector systems. In all of the identified systems, the posterior probabilities and sufficient statistics calculations represent a computational bottleneck in both enrollment and testing. We propose a multi-layered hash system, employing a tree-structured GMM–UBM which uses Runnalls’ Gaussian mixture reduction technique, in order to reduce the number of these calculations. Moreover, with this tree-structured hash, wemore » can trade-off reduction in computation with a corresponding degradation of equal error rate (EER). As an example, we also reduce this computation by a factor of 15× while incurring less than 10% relative degradation of EER (or 0.3% absolute EER) when evaluated with NIST 2010 speaker recognition evaluation (SRE) telephone data.« less
McClanahan, Richard; De Leon, Phillip L.
2014-08-20
The majority of state-of-the-art speaker recognition systems (SR) utilize speaker models that are derived from an adapted universal background model (UBM) in the form of a Gaussian mixture model (GMM). This is true for GMM supervector systems, joint factor analysis systems, and most recently i-vector systems. In all of the identified systems, the posterior probabilities and sufficient statistics calculations represent a computational bottleneck in both enrollment and testing. We propose a multi-layered hash system, employing a tree-structured GMM–UBM which uses Runnalls’ Gaussian mixture reduction technique, in order to reduce the number of these calculations. Moreover, with this tree-structured hash, wemore » can trade-off reduction in computation with a corresponding degradation of equal error rate (EER). As an example, we also reduce this computation by a factor of 15× while incurring less than 10% relative degradation of EER (or 0.3% absolute EER) when evaluated with NIST 2010 speaker recognition evaluation (SRE) telephone data.« less
The Use of Voice Cues for Speaker Gender Recognition in Cochlear Implant Recipients
ERIC Educational Resources Information Center
Meister, Hartmut; Fürsen, Katrin; Streicher, Barbara; Lang-Roth, Ruth; Walger, Martin
2016-01-01
Purpose: The focus of this study was to examine the influence of fundamental frequency (F0) and vocal tract length (VTL) modifications on speaker gender recognition in cochlear implant (CI) recipients for different stimulus types. Method: Single words and sentences were manipulated using isolated or combined F0 and VTL cues. Using an 11-point…
``The perceptual bases of speaker identity'' revisited
NASA Astrophysics Data System (ADS)
Voiers, William D.
2003-10-01
A series of experiments begun 40 years ago [W. D. Voiers, J. Acoust. Soc. Am. 36, 1065-1073 (1964)] was concerned with identifying the perceived voice traits (PVTs) on which human recognition of voices depends. It culminated with the development of a voice taxonomy based on 20 PVTs and a set of highly reliable rating scales for classifying voices with respect to those PVTs. The development of a perceptual voice taxonomy was motivated by the need for a practical method of evaluating speaker recognizability in voice communication systems. The Diagnostic Speaker Recognition Test (DSRT) evaluates the effects of systems on speaker recognizability as reflected in changes in the inter-listener reliability of voice ratings on the 20 PVTs. The DSRT thus provides a qualitative, as well as quantitative, evaluation of the effects of a system on speaker recognizability. A fringe benefit of this project is PVT rating data for a sample of 680 voices. [Work partially supported by USAFRL.
Chaspari, Theodora; Soldatos, Constantin; Maragos, Petros
2015-01-01
The development of ecologically valid procedures for collecting reliable and unbiased emotional data towards computer interfaces with social and affective intelligence targeting patients with mental disorders. Following its development, presented with, the Athens Emotional States Inventory (AESI) proposes the design, recording and validation of an audiovisual database for five emotional states: anger, fear, joy, sadness and neutral. The items of the AESI consist of sentences each having content indicative of the corresponding emotion. Emotional content was assessed through a survey of 40 young participants with a questionnaire following the Latin square design. The emotional sentences that were correctly identified by 85% of the participants were recorded in a soundproof room with microphones and cameras. A preliminary validation of AESI is performed through automatic emotion recognition experiments from speech. The resulting database contains 696 recorded utterances in Greek language by 20 native speakers and has a total duration of approximately 28 min. Speech classification results yield accuracy up to 75.15% for automatically recognizing the emotions in AESI. These results indicate the usefulness of our approach for collecting emotional data with reliable content, balanced across classes and with reduced environmental variability.
Arruti, Andoni; Cearreta, Idoia; Álvarez, Aitor; Lazkano, Elena; Sierra, Basilio
2014-01-01
Study of emotions in human–computer interaction is a growing research area. This paper shows an attempt to select the most significant features for emotion recognition in spoken Basque and Spanish Languages using different methods for feature selection. RekEmozio database was used as the experimental data set. Several Machine Learning paradigms were used for the emotion classification task. Experiments were executed in three phases, using different sets of features as classification variables in each phase. Moreover, feature subset selection was applied at each phase in order to seek for the most relevant feature subset. The three phases approach was selected to check the validity of the proposed approach. Achieved results show that an instance-based learning algorithm using feature subset selection techniques based on evolutionary algorithms is the best Machine Learning paradigm in automatic emotion recognition, with all different feature sets, obtaining a mean of 80,05% emotion recognition rate in Basque and a 74,82% in Spanish. In order to check the goodness of the proposed process, a greedy searching approach (FSS-Forward) has been applied and a comparison between them is provided. Based on achieved results, a set of most relevant non-speaker dependent features is proposed for both languages and new perspectives are suggested. PMID:25279686
NASA Astrophysics Data System (ADS)
Morishima, Shigeo; Nakamura, Satoshi
2004-12-01
We introduce a multimodal English-to-Japanese and Japanese-to-English translation system that also translates the speaker's speech motion by synchronizing it to the translated speech. This system also introduces both a face synthesis technique that can generate any viseme lip shape and a face tracking technique that can estimate the original position and rotation of a speaker's face in an image sequence. To retain the speaker's facial expression, we substitute only the speech organ's image with the synthesized one, which is made by a 3D wire-frame model that is adaptable to any speaker. Our approach provides translated image synthesis with an extremely small database. The tracking motion of the face from a video image is performed by template matching. In this system, the translation and rotation of the face are detected by using a 3D personal face model whose texture is captured from a video frame. We also propose a method to customize the personal face model by using our GUI tool. By combining these techniques and the translated voice synthesis technique, an automatic multimodal translation can be achieved that is suitable for video mail or automatic dubbing systems into other languages.
Human phoneme recognition depending on speech-intrinsic variability.
Meyer, Bernd T; Jürgens, Tim; Wesker, Thorsten; Brand, Thomas; Kollmeier, Birger
2010-11-01
The influence of different sources of speech-intrinsic variation (speaking rate, effort, style and dialect or accent) on human speech perception was investigated. In listening experiments with 16 listeners, confusions of consonant-vowel-consonant (CVC) and vowel-consonant-vowel (VCV) sounds in speech-weighted noise were analyzed. Experiments were based on the OLLO logatome speech database, which was designed for a man-machine comparison. It contains utterances spoken by 50 speakers from five dialect/accent regions and covers several intrinsic variations. By comparing results depending on intrinsic and extrinsic variations (i.e., different levels of masking noise), the degradation induced by variabilities can be expressed in terms of the SNR. The spectral level distance between the respective speech segment and the long-term spectrum of the masking noise was found to be a good predictor for recognition rates, while phoneme confusions were influenced by the distance to spectrally close phonemes. An analysis based on transmitted information of articulatory features showed that voicing and manner of articulation are comparatively robust cues in the presence of intrinsic variations, whereas the coding of place is more degraded. The database and detailed results have been made available for comparisons between human speech recognition (HSR) and automatic speech recognizers (ASR).
A voice-input voice-output communication aid for people with severe speech impairment.
Hawley, Mark S; Cunningham, Stuart P; Green, Phil D; Enderby, Pam; Palmer, Rebecca; Sehgal, Siddharth; O'Neill, Peter
2013-01-01
A new form of augmentative and alternative communication (AAC) device for people with severe speech impairment-the voice-input voice-output communication aid (VIVOCA)-is described. The VIVOCA recognizes the disordered speech of the user and builds messages, which are converted into synthetic speech. System development was carried out employing user-centered design and development methods, which identified and refined key requirements for the device. A novel methodology for building small vocabulary, speaker-dependent automatic speech recognizers with reduced amounts of training data, was applied. Experiments showed that this method is successful in generating good recognition performance (mean accuracy 96%) on highly disordered speech, even when recognition perplexity is increased. The selected message-building technique traded off various factors including speed of message construction and range of available message outputs. The VIVOCA was evaluated in a field trial by individuals with moderate to severe dysarthria and confirmed that they can make use of the device to produce intelligible speech output from disordered speech input. The trial highlighted some issues which limit the performance and usability of the device when applied in real usage situations, with mean recognition accuracy of 67% in these circumstances. These limitations will be addressed in future work.
DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS
2015-05-29
DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS Elliot Singer and Douglas Reynolds Massachusetts Institute of...development data is assumed to be unavailable. The method is based on a generalization of data whitening used in association with i-vector length...normalization and utilizes a library of whitening transforms trained at system development time using strictly out-of-domain data. The approach is
A language-familiarity effect for speaker discrimination without comprehension.
Fleming, David; Giordano, Bruno L; Caldara, Roberto; Belin, Pascal
2014-09-23
The influence of language familiarity upon speaker identification is well established, to such an extent that it has been argued that "Human voice recognition depends on language ability" [Perrachione TK, Del Tufo SN, Gabrieli JDE (2011) Science 333(6042):595]. However, 7-mo-old infants discriminate speakers of their mother tongue better than they do foreign speakers [Johnson EK, Westrek E, Nazzi T, Cutler A (2011) Dev Sci 14(5):1002-1011] despite their limited speech comprehension abilities, suggesting that speaker discrimination may rely on familiarity with the sound structure of one's native language rather than the ability to comprehend speech. To test this hypothesis, we asked Chinese and English adult participants to rate speaker dissimilarity in pairs of sentences in English or Mandarin that were first time-reversed to render them unintelligible. Even in these conditions a language-familiarity effect was observed: Both Chinese and English listeners rated pairs of native-language speakers as more dissimilar than foreign-language speakers, despite their inability to understand the material. Our data indicate that the language familiarity effect is not based on comprehension but rather on familiarity with the phonology of one's native language. This effect may stem from a mechanism analogous to the "other-race" effect in face recognition.
Processing Lexical and Speaker Information in Repetition and Semantic/Associative Priming
ERIC Educational Resources Information Center
Lee, Chao-Yang; Zhang, Yu
2018-01-01
The purpose of this study is to investigate the interaction between processing lexical and speaker-specific information in spoken word recognition. The specific question is whether repetition and semantic/associative priming is reduced when the prime and target are produced by different speakers. In Experiment 1, the prime and target were repeated…
Integrating hidden Markov model and PRAAT: a toolbox for robust automatic speech transcription
NASA Astrophysics Data System (ADS)
Kabir, A.; Barker, J.; Giurgiu, M.
2010-09-01
An automatic time-aligned phone transcription toolbox of English speech corpora has been developed. Especially the toolbox would be very useful to generate robust automatic transcription and able to produce phone level transcription using speaker independent models as well as speaker dependent models without manual intervention. The system is based on standard Hidden Markov Models (HMM) approach and it was successfully experimented over a large audiovisual speech corpus namely GRID corpus. One of the most powerful features of the toolbox is the increased flexibility in speech processing where the speech community would be able to import the automatic transcription generated by HMM Toolkit (HTK) into a popular transcription software, PRAAT, and vice-versa. The toolbox has been evaluated through statistical analysis on GRID data which shows that automatic transcription deviates by an average of 20 ms with respect to manual transcription.
Bilingual Language Switching: Production vs. Recognition
Mosca, Michela; de Bot, Kees
2017-01-01
This study aims at assessing how bilinguals select words in the appropriate language in production and recognition while minimizing interference from the non-appropriate language. Two prominent models are considered which assume that when one language is in use, the other is suppressed. The Inhibitory Control (IC) model suggests that, in both production and recognition, the amount of inhibition on the non-target language is greater for the stronger compared to the weaker language. In contrast, the Bilingual Interactive Activation (BIA) model proposes that, in language recognition, the amount of inhibition on the weaker language is stronger than otherwise. To investigate whether bilingual language production and recognition can be accounted for by a single model of bilingual processing, we tested a group of native speakers of Dutch (L1), advanced speakers of English (L2) in a bilingual recognition and production task. Specifically, language switching costs were measured while participants performed a lexical decision (recognition) and a picture naming (production) task involving language switching. Results suggest that while in language recognition the amount of inhibition applied to the non-appropriate language increases along with its dominance as predicted by the IC model, in production the amount of inhibition applied to the non-relevant language is not related to language dominance, but rather it may be modulated by speakers' unconscious strategies to foster the weaker language. This difference indicates that bilingual language recognition and production might rely on different processing mechanisms and cannot be accounted within one of the existing models of bilingual language processing. PMID:28638361
Bilingual Language Switching: Production vs. Recognition.
Mosca, Michela; de Bot, Kees
2017-01-01
This study aims at assessing how bilinguals select words in the appropriate language in production and recognition while minimizing interference from the non-appropriate language. Two prominent models are considered which assume that when one language is in use, the other is suppressed. The Inhibitory Control (IC) model suggests that, in both production and recognition, the amount of inhibition on the non-target language is greater for the stronger compared to the weaker language. In contrast, the Bilingual Interactive Activation (BIA) model proposes that, in language recognition, the amount of inhibition on the weaker language is stronger than otherwise. To investigate whether bilingual language production and recognition can be accounted for by a single model of bilingual processing, we tested a group of native speakers of Dutch (L1), advanced speakers of English (L2) in a bilingual recognition and production task. Specifically, language switching costs were measured while participants performed a lexical decision (recognition) and a picture naming (production) task involving language switching. Results suggest that while in language recognition the amount of inhibition applied to the non-appropriate language increases along with its dominance as predicted by the IC model, in production the amount of inhibition applied to the non-relevant language is not related to language dominance, but rather it may be modulated by speakers' unconscious strategies to foster the weaker language. This difference indicates that bilingual language recognition and production might rely on different processing mechanisms and cannot be accounted within one of the existing models of bilingual language processing.
Speaker Recognition Using Real vs. Synthetic Parallel Data for DNN Channel Compensation
2016-09-08
Speaker Recognition Using Real vs Synthetic Parallel Data for DNN Channel Compensation Fred Richardson, Michael Brandstein, Jennifer Melot, and...DNNs trained with real Mixer 2 multichannel data perform only slightly better than DNNs trained with synthetic multichannel data for microphone SR on...Mixer 6. Large re- ductions in pooled error rates of 50% EER and 30% min DCF are achieved using DNNs trained on real Mixer 2 data. Nearly the same
DOE Office of Scientific and Technical Information (OSTI.GOV)
Not Available
1990-09-01
These conference proceedings have been prepared in support of the US Nuclear Regulatory Commission's Security Training Symposium on Meeting the Challenge -- Firearms and Explosives Recognition and Detection,'' November 28 through 30, 1989, in Bethesda, Maryland. This document contains the edited transcripts of the guest speakers. It also contains some of the speakers' formal papers that were distributed and some of the slides that were shown at the symposium (Appendix A).
The Search for Common Ground: Part I. Lexical Performance by Linguistically Diverse Learners.
ERIC Educational Resources Information Center
Windsor, Jennifer; Kohnert, Kathryn
2004-01-01
This study examines lexical performance by 3 groups of linguistically diverse school-age learners: English-only speakers with primary language impairment (LI), typical English-only speakers (EO), and typical bilingual Spanish-English speakers (BI). The accuracy and response time (RT) of 100 8- to 13-year-old children in word recognition and…
Phoneme Error Pattern by Heritage Speakers of Spanish on an English Word Recognition Test.
Shi, Lu-Feng
2017-04-01
Heritage speakers acquire their native language from home use in their early childhood. As the native language is typically a minority language in the society, these individuals receive their formal education in the majority language and eventually develop greater competency with the majority than their native language. To date, there have not been specific research attempts to understand word recognition by heritage speakers. It is not clear if and to what degree we may infer from evidence based on bilingual listeners in general. This preliminary study investigated how heritage speakers of Spanish perform on an English word recognition test and analyzed their phoneme errors. A prospective, cross-sectional, observational design was employed. Twelve normal-hearing adult Spanish heritage speakers (four men, eight women, 20-38 yr old) participated in the study. Their language background was obtained through the Language Experience and Proficiency Questionnaire. Nine English monolingual listeners (three men, six women, 20-41 yr old) were also included for comparison purposes. Listeners were presented with 200 Northwestern University Auditory Test No. 6 words in quiet. They repeated each word orally and in writing. Their responses were scored by word, word-initial consonant, vowel, and word-final consonant. Performance was compared between groups with Student's t test or analysis of variance. Group-specific error patterns were primarily descriptive, but intergroup comparisons were made using 95% or 99% confidence intervals for proportional data. The two groups of listeners yielded comparable scores when their responses were examined by word, vowel, and final consonant. However, heritage speakers of Spanish misidentified significantly more word-initial consonants and had significantly more difficulty with initial /p, b, h/ than their monolingual peers. The two groups yielded similar patterns for vowel and word-final consonants, but heritage speakers made significantly fewer errors with /e/ and more errors with word-final /p, k/. Data reported in the present study lead to a twofold conclusion. On the one hand, normal-hearing heritage speakers of Spanish may misidentify English phonemes in patterns different from those of English monolingual listeners. Not all phoneme errors can be readily understood by comparing Spanish and English phonology, suggesting that Spanish heritage speakers differ in performance from other Spanish-English bilingual listeners. On the other hand, the absolute number of errors and the error pattern of most phonemes were comparable between English monolingual listeners and Spanish heritage speakers, suggesting that audiologists may assess word recognition in quiet in the same way for these two groups of listeners, if diagnosis is based on words, not phonemes. American Academy of Audiology
NASA Astrophysics Data System (ADS)
Kuroki, Hayato; Ino, Shuichi; Nakano, Satoko; Hori, Kotaro; Ifukube, Tohru
The authors of this paper have been studying a real-time speech-to-caption system using speech recognition technology with a “repeat-speaking” method. In this system, they used a “repeat-speaker” who listens to a lecturer's voice and then speaks back the lecturer's speech utterances into a speech recognition computer. The througoing system showed that the accuracy of the captions is about 97% in Japanese-Japanese conversion and the conversion time from voices to captions is about 4 seconds in English-English conversion in some international conferences. Of course it required a lot of costs to achieve these high performances. In human communications, speech understanding depends not only on verbal information but also on non-verbal information such as speaker's gestures, and face and mouth movements. So the authors found the idea to display information of captions and speaker's face movement images with a suitable way to achieve a higher comprehension after storing information once into a computer briefly. In this paper, we investigate the relationship of the display sequence and display timing between captions that have speech recognition errors and the speaker's face movement images. The results show that the sequence “to display the caption before the speaker's face image” improves the comprehension of the captions. The sequence “to display both simultaneously” shows an improvement only a few percent higher than the question sentence, and the sequence “to display the speaker's face image before the caption” shows almost no change. In addition, the sequence “to display the caption 1 second before the speaker's face shows the most significant improvement of all the conditions.
NASA Astrophysics Data System (ADS)
Yellen, H. W.
1983-03-01
Literature pertaining to Voice Recognition abounds with information relevant to the assessment of transitory speech recognition devices. In the past, engineering requirements have dictated the path this technology followed. But, other factors do exist that influence recognition accuracy. This thesis explores the impact of Human Factors on the successful recognition of speech, principally addressing the differences or variability among users. A Threshold Technology T-600 was used for a 100 utterance vocubalary to test 44 subjects. A statistical analysis was conducted on 5 generic categories of Human Factors: Occupational, Operational, Psychological, Physiological and Personal. How the equipment is trained and the experience level of the speaker were found to be key characteristics influencing recognition accuracy. To a lesser extent computer experience, time or week, accent, vital capacity and rate of air flow, speaker cooperativeness and anxiety were found to affect overall error rates.
Shattuck-Hufnagel, S.; Choi, J. Y.; Moro-Velázquez, L.; Gómez-García, J. A.
2017-01-01
Although a large amount of acoustic indicators have already been proposed in the literature to evaluate the hypokinetic dysarthria of people with Parkinson’s Disease, the goal of this work is to identify and interpret new reliable and complementary articulatory biomarkers that could be applied to predict/evaluate Parkinson’s Disease from a diadochokinetic test, contributing to the possibility of a further multidimensional analysis of the speech of parkinsonian patients. The new biomarkers proposed are based on the kinetic behaviour of the envelope trace, which is directly linked with the articulatory dysfunctions introduced by the disease since the early stages. The interest of these new articulatory indicators stands on their easiness of identification and interpretation, and their potential to be translated into computer based automatic methods to screen the disease from the speech. Throughout this paper, the accuracy provided by these acoustic kinetic biomarkers is compared with the one obtained with a baseline system based on speaker identification techniques. Results show accuracies around 85% that are in line with those obtained with the complex state of the art speaker recognition techniques, but with an easier physical interpretation, which open the possibility to be transferred to a clinical setting. PMID:29240814
Speaker Verification Using SVM
2010-11-01
application the required resources are provided by the phone itself. Speaker recognition can be used in many areas, like: • homeland security: airport ... security , strengthening the national borders, in travel documents, visas; • enterprise-wide network security infrastructures; • secure electronic
NASA Astrophysics Data System (ADS)
Karam, Walid; Mokbel, Chafic; Greige, Hanna; Chollet, Gerard
2006-05-01
A GMM based audio visual speaker verification system is described and an Active Appearance Model with a linear speaker transformation system is used to evaluate the robustness of the verification. An Active Appearance Model (AAM) is used to automatically locate and track a speaker's face in a video recording. A Gaussian Mixture Model (GMM) based classifier (BECARS) is used for face verification. GMM training and testing is accomplished on DCT based extracted features of the detected faces. On the audio side, speech features are extracted and used for speaker verification with the GMM based classifier. Fusion of both audio and video modalities for audio visual speaker verification is compared with face verification and speaker verification systems. To improve the robustness of the multimodal biometric identity verification system, an audio visual imposture system is envisioned. It consists of an automatic voice transformation technique that an impostor may use to assume the identity of an authorized client. Features of the transformed voice are then combined with the corresponding appearance features and fed into the GMM based system BECARS for training. An attempt is made to increase the acceptance rate of the impostor and to analyzing the robustness of the verification system. Experiments are being conducted on the BANCA database, with a prospect of experimenting on the newly developed PDAtabase developed within the scope of the SecurePhone project.
Memory for syntax despite amnesia.
Ferreira, Victor S; Bock, Kathryn; Wilson, Michael P; Cohen, Neal J
2008-09-01
Syntactic persistence is a tendency for speakers to reproduce sentence structures independently of accompanying meanings, words, or sounds. The memory mechanisms behind syntactic persistence are not fully understood. Although some properties of syntactic persistence suggest a role for procedural memory, current evidence suggests that procedural memory (unlike declarative memory) does not maintain the abstract, relational features that are inherent to syntactic structures. In a study evaluating the contribution of procedural memory to syntactic persistence, patients with anterograde amnesia and matched control speakers reproduced prime sentences with different syntactic structures; reproduced 0, 1, 6, or 10 neutral sentences; then spontaneously described pictures that elicited the primed structures; and finally made recognition judgments for the prime sentences. Amnesic and control speakers showed significant and equivalent syntactic persistence, despite the amnesic speakers' profoundly impaired recognition memory for the primes. Thus, syntax is maintained by procedural-memory mechanisms. This result reveals that procedural memory is capable of supporting abstract, relational knowledge.
NASA Astrophysics Data System (ADS)
Ghoraani, Behnaz; Krishnan, Sridhar
2009-12-01
The number of people affected by speech problems is increasing as the modern world places increasing demands on the human voice via mobile telephones, voice recognition software, and interpersonal verbal communications. In this paper, we propose a novel methodology for automatic pattern classification of pathological voices. The main contribution of this paper is extraction of meaningful and unique features using Adaptive time-frequency distribution (TFD) and nonnegative matrix factorization (NMF). We construct Adaptive TFD as an effective signal analysis domain to dynamically track the nonstationarity in the speech and utilize NMF as a matrix decomposition (MD) technique to quantify the constructed TFD. The proposed method extracts meaningful and unique features from the joint TFD of the speech, and automatically identifies and measures the abnormality of the signal. Depending on the abnormality measure of each signal, we classify the signal into normal or pathological. The proposed method is applied on the Massachusetts Eye and Ear Infirmary (MEEI) voice disorders database which consists of 161 pathological and 51 normal speakers, and an overall classification accuracy of 98.6% was achieved.
CNN: a speaker recognition system using a cascaded neural network.
Zaki, M; Ghalwash, A; Elkouny, A A
1996-05-01
The main emphasis of this paper is to present an approach for combining supervised and unsupervised neural network models to the issue of speaker recognition. To enhance the overall operation and performance of recognition, the proposed strategy integrates the two techniques, forming one global model called the cascaded model. We first present a simple conventional technique based on the distance measured between a test vector and a reference vector for different speakers in the population. This particular distance metric has the property of weighting down the components in those directions along which the intraspeaker variance is large. The reason for presenting this method is to clarify the discrepancy in performance between the conventional and neural network approach. We then introduce the idea of using unsupervised learning technique, presented by the winner-take-all model, as a means of recognition. Due to several tests that have been conducted and in order to enhance the performance of this model, dealing with noisy patterns, we have preceded it with a supervised learning model--the pattern association model--which acts as a filtration stage. This work includes both the design and implementation of both conventional and neural network approaches to recognize the speakers templates--which are introduced to the system via a voice master card and preprocessed before extracting the features used in the recognition. The conclusion indicates that the system performance in case of neural network is better than that of the conventional one, achieving a smooth degradation in respect of noisy patterns, and higher performance in respect of noise-free patterns.
Smits-Bandstra, Sarah; De Nil, Luc F
2007-01-01
The basal ganglia and cortico-striato-thalamo-cortical connections are known to play a critical role in sequence skill learning and increasing automaticity over practice. The current paper reviews four studies comparing the sequence skill learning and the transition to automaticity of persons who stutter (PWS) and fluent speakers (PNS) over practice. Studies One and Two found PWS to have poor finger tap sequencing skill and nonsense syllable sequencing skill after practice, and on retention and transfer tests relative to PNS. Studies Three and Four found PWS to be significantly less accurate and/or significantly slower after practice on dual tasks requiring concurrent sequencing and colour recognition over practice relative to PNS. Evidence of PWS' deficits in sequence skill learning and automaticity development support the hypothesis that dysfunction in cortico-striato-thalamo-cortical connections may be one etiological component in the development and maintenance of stuttering. As a result of this activity, the reader will: (1) be able to articulate the research regarding the basal ganglia system relating to sequence skill learning; (2) be able to summarize the research on stuttering with indications of sequence skill learning deficits; and (3) be able to discuss basal ganglia mechanisms with relevance for theory of stuttering.
Second Language Learners and Speech Act Comprehension
ERIC Educational Resources Information Center
Holtgraves, Thomas
2007-01-01
Recognizing the specific speech act ( Searle, 1969) that a speaker performs with an utterance is a fundamental feature of pragmatic competence. Past research has demonstrated that native speakers of English automatically recognize speech acts when they comprehend utterances (Holtgraves & Ashley, 2001). The present research examined whether this…
Advances in audio source seperation and multisource audio content retrieval
NASA Astrophysics Data System (ADS)
Vincent, Emmanuel
2012-06-01
Audio source separation aims to extract the signals of individual sound sources from a given recording. In this paper, we review three recent advances which improve the robustness of source separation in real-world challenging scenarios and enable its use for multisource content retrieval tasks, such as automatic speech recognition (ASR) or acoustic event detection (AED) in noisy environments. We present a Flexible Audio Source Separation Toolkit (FASST) and discuss its advantages compared to earlier approaches such as independent component analysis (ICA) and sparse component analysis (SCA). We explain how cues as diverse as harmonicity, spectral envelope, temporal fine structure or spatial location can be jointly exploited by this toolkit. We subsequently present the uncertainty decoding (UD) framework for the integration of audio source separation and audio content retrieval. We show how the uncertainty about the separated source signals can be accurately estimated and propagated to the features. Finally, we explain how this uncertainty can be efficiently exploited by a classifier, both at the training and the decoding stage. We illustrate the resulting performance improvements in terms of speech separation quality and speaker recognition accuracy.
NASA Astrophysics Data System (ADS)
Adhi Pradana, Wisnu; Adiwijaya; Novia Wisesty, Untari
2018-03-01
Support Vector Machine or commonly called SVM is one method that can be used to process the classification of a data. SVM classifies data from 2 different classes with hyperplane. In this study, the system was built using SVM to develop Arabic Speech Recognition. In the development of the system, there are 2 kinds of speakers that have been tested that is dependent speakers and independent speakers. The results from this system is an accuracy of 85.32% for speaker dependent and 61.16% for independent speakers.
"Who" is saying "what"? Brain-based decoding of human voice and speech.
Formisano, Elia; De Martino, Federico; Bonte, Milene; Goebel, Rainer
2008-11-07
Can we decipher speech content ("what" is being said) and speaker identity ("who" is saying it) from observations of brain activity of a listener? Here, we combine functional magnetic resonance imaging with a data-mining algorithm and retrieve what and whom a person is listening to from the neural fingerprints that speech and voice signals elicit in the listener's auditory cortex. These cortical fingerprints are spatially distributed and insensitive to acoustic variations of the input so as to permit the brain-based recognition of learned speech from unknown speakers and of learned voices from previously unheard utterances. Our findings unravel the detailed cortical layout and computational properties of the neural populations at the basis of human speech recognition and speaker identification.
Semantic Ambiguity Effects in L2 Word Recognition.
Ishida, Tomomi
2018-06-01
The present study examined the ambiguity effects in second language (L2) word recognition. Previous studies on first language (L1) lexical processing have observed that ambiguous words are recognized faster and more accurately than unambiguous words on lexical decision tasks. In this research, L1 and L2 speakers of English were asked whether a letter string on a computer screen was an English word or not. An ambiguity advantage was found for both groups and greater ambiguity effects were found for the non-native speaker group when compared to the native speaker group. The findings imply that the larger ambiguity advantage for L2 processing is due to their slower response time in producing adequate feedback activation from the semantic level to the orthographic level.
Dmitrieva, E S; Gel'man, V Ia
2011-01-01
The listener-distinctive features of recognition of different emotional intonations (positive, negative and neutral) of male and female speakers in the presence or absence of background noise were studied in 49 adults aged 20-79 years. In all the listeners noise produced the most pronounced decrease in recognition accuracy for positive emotional intonation ("joy") as compared to other intonations, whereas it did not influence the recognition accuracy of "anger" in 65-79-year-old listeners. The higher emotion recognition rates of a noisy signal were observed for speech emotional intonations expressed by female speakers. Acoustic characteristics of noisy and clear speech signals underlying perception of speech emotional prosody were found for adult listeners of different age and gender.
Fuhrman, Orly; Boroditsky, Lera
2010-11-01
Across cultures people construct spatial representations of time. However, the particular spatial layouts created to represent time may differ across cultures. This paper examines whether people automatically access and use culturally specific spatial representations when reasoning about time. In Experiment 1, we asked Hebrew and English speakers to arrange pictures depicting temporal sequences of natural events, and to point to the hypothesized location of events relative to a reference point. In both tasks, English speakers (who read left to right) arranged temporal sequences to progress from left to right, whereas Hebrew speakers (who read right to left) arranged them from right to left, replicating previous work. In Experiments 2 and 3, we asked the participants to make rapid temporal order judgments about pairs of pictures presented one after the other (i.e., to decide whether the second picture showed a conceptually earlier or later time-point of an event than the first picture). Participants made responses using two adjacent keyboard keys. English speakers were faster to make "earlier" judgments when the "earlier" response needed to be made with the left response key than with the right response key. Hebrew speakers showed exactly the reverse pattern. Asking participants to use a space-time mapping inconsistent with the one suggested by writing direction in their language created interference, suggesting that participants were automatically creating writing-direction consistent spatial representations in the course of their normal temporal reasoning. It appears that people automatically access culturally specific spatial representations when making temporal judgments even in nonlinguistic tasks. Copyright © 2010 Cognitive Science Society, Inc.
Practical automatic Arabic license plate recognition system
NASA Astrophysics Data System (ADS)
Mohammad, Khader; Agaian, Sos; Saleh, Hani
2011-02-01
Since 1970's, the need of an automatic license plate recognition system, sometimes referred as Automatic License Plate Recognition system, has been increasing. A license plate recognition system is an automatic system that is able to recognize a license plate number, extracted from image sensors. In specific, Automatic License Plate Recognition systems are being used in conjunction with various transportation systems in application areas such as law enforcement (e.g. speed limit enforcement) and commercial usages such as parking enforcement and automatic toll payment private and public entrances, border control, theft and vandalism control. Vehicle license plate recognition has been intensively studied in many countries. Due to the different types of license plates being used, the requirement of an automatic license plate recognition system is different for each country. [License plate detection using cluster run length smoothing algorithm ].Generally, an automatic license plate localization and recognition system is made up of three modules; license plate localization, character segmentation and optical character recognition modules. This paper presents an Arabic license plate recognition system that is insensitive to character size, font, shape and orientation with extremely high accuracy rate. The proposed system is based on a combination of enhancement, license plate localization, morphological processing, and feature vector extraction using the Haar transform. The performance of the system is fast due to classification of alphabet and numerals based on the license plate organization. Experimental results for license plates of two different Arab countries show an average of 99 % successful license plate localization and recognition in a total of more than 20 different images captured from a complex outdoor environment. The results run times takes less time compared to conventional and many states of art methods.
Development of equally intelligible Telugu sentence-lists to test speech recognition in noise.
Tanniru, Kishore; Narne, Vijaya Kumar; Jain, Chandni; Konadath, Sreeraj; Singh, Niraj Kumar; Sreenivas, K J Ramadevi; K, Anusha
2017-09-01
To develop sentence lists in the Telugu language for the assessment of speech recognition threshold (SRT) in the presence of background noise through identification of the mean signal-to-noise ratio required to attain a 50% sentence recognition score (SRTn). This study was conducted in three phases. The first phase involved the selection and recording of Telugu sentences. In the second phase, 20 lists, each consisting of 10 sentences with equal intelligibility, were formulated using a numerical optimisation procedure. In the third phase, the SRTn of the developed lists was estimated using adaptive procedures on individuals with normal hearing. A total of 68 native Telugu speakers with normal hearing participated in the study. Of these, 18 (including the speakers) performed on various subjective measures in first phase, 20 performed on sentence/word recognition in noise for second phase and 30 participated in the list equivalency procedures in third phase. In all, 15 lists of comparable difficulty were formulated as test material. The mean SRTn across these lists corresponded to -2.74 (SD = 0.21). The developed sentence lists provided a valid and reliable tool to measure SRTn in Telugu native speakers.
Current trends in small vocabulary speech recognition for equipment control
NASA Astrophysics Data System (ADS)
Doukas, Nikolaos; Bardis, Nikolaos G.
2017-09-01
Speech recognition systems allow human - machine communication to acquire an intuitive nature that approaches the simplicity of inter - human communication. Small vocabulary speech recognition is a subset of the overall speech recognition problem, where only a small number of words need to be recognized. Speaker independent small vocabulary recognition can find significant applications in field equipment used by military personnel. Such equipment may typically be controlled by a small number of commands that need to be given quickly and accurately, under conditions where delicate manual operations are difficult to achieve. This type of application could hence significantly benefit by the use of robust voice operated control components, as they would facilitate the interaction with their users and render it much more reliable in times of crisis. This paper presents current challenges involved in attaining efficient and robust small vocabulary speech recognition. These challenges concern feature selection, classification techniques, speaker diversity and noise effects. A state machine approach is presented that facilitates the voice guidance of different equipment in a variety of situations.
NASA Astrophysics Data System (ADS)
Mosko, J. D.; Stevens, K. N.; Griffin, G. R.
1983-08-01
Acoustical analyses were conducted of words produced by four speakers in a motion stress-inducing situation. The aim of the analyses was to document the kinds of changes that occur in the vocal utterances of speakers who are exposed to motion stress and to comment on the implications of these results for the design and development of voice interactive systems. The speakers differed markedly in the types and magnitudes of the changes that occurred in their speech. For some speakers, the stress-inducing experimental condition caused an increase in fundamental frequency, changes in the pattern of vocal fold vibration, shifts in vowel production and changes in the relative amplitudes of sounds containing turbulence noise. All speakers showed greater variability in the experimental condition than in more relaxed control situation. The variability was manifested in the acoustical characteristics of individual phonetic elements, particularly in speech sound variability observed serve to unstressed syllables. The kinds of changes and variability observed serve to emphasize the limitations of speech recognition systems based on template matching of patterns that are stored in the system during a training phase. There is need for a better understanding of these phonetic modifications and for developing ways of incorporating knowledge about these changes within a speech recognition system.
Retrieving Tract Variables From Acoustics: A Comparison of Different Machine Learning Strategies.
Mitra, Vikramjit; Nam, Hosung; Espy-Wilson, Carol Y; Saltzman, Elliot; Goldstein, Louis
2010-09-13
Many different studies have claimed that articulatory information can be used to improve the performance of automatic speech recognition systems. Unfortunately, such articulatory information is not readily available in typical speaker-listener situations. Consequently, such information has to be estimated from the acoustic signal in a process which is usually termed "speech-inversion." This study aims to propose and compare various machine learning strategies for speech inversion: Trajectory mixture density networks (TMDNs), feedforward artificial neural networks (FF-ANN), support vector regression (SVR), autoregressive artificial neural network (AR-ANN), and distal supervised learning (DSL). Further, using a database generated by the Haskins Laboratories speech production model, we test the claim that information regarding constrictions produced by the distinct organs of the vocal tract (vocal tract variables) is superior to flesh-point information (articulatory pellet trajectories) for the inversion process.
2002-06-07
Continue to Develop and Refine Emerging Technology • Some of the emerging biometric devices, such as iris scans and facial recognition systems...such as iris scans and facial recognition systems, facial recognition systems, and speaker verification systems. (976301)
Uskul, Ayse K; Paulmann, Silke; Weick, Mario
2016-02-01
Listeners have to pay close attention to a speaker's tone of voice (prosody) during daily conversations. This is particularly important when trying to infer the emotional state of the speaker. Although a growing body of research has explored how emotions are processed from speech in general, little is known about how psychosocial factors such as social power can shape the perception of vocal emotional attributes. Thus, the present studies explored how social power affects emotional prosody recognition. In a correlational study (Study 1) and an experimental study (Study 2), we show that high power is associated with lower accuracy in emotional prosody recognition than low power. These results, for the first time, suggest that individuals experiencing high or low power perceive emotional tone of voice differently. (c) 2016 APA, all rights reserved).
Connected word recognition using a cascaded neuro-computational model
NASA Astrophysics Data System (ADS)
Hoya, Tetsuya; van Leeuwen, Cees
2016-10-01
We propose a novel framework for processing a continuous speech stream that contains a varying number of words, as well as non-speech periods. Speech samples are segmented into word-tokens and non-speech periods. An augmented version of an earlier-proposed, cascaded neuro-computational model is used for recognising individual words within the stream. Simulation studies using both a multi-speaker-dependent and speaker-independent digit string database show that the proposed method yields a recognition performance comparable to that obtained by a benchmark approach using hidden Markov models with embedded training.
Zäske, Romi; Awwad Shiekh Hasan, Bashar; Belin, Pascal
2017-09-01
Listeners can recognize newly learned voices from previously unheard utterances, suggesting the acquisition of high-level speech-invariant voice representations during learning. Using functional magnetic resonance imaging (fMRI) we investigated the anatomical basis underlying the acquisition of voice representations for unfamiliar speakers independent of speech, and their subsequent recognition among novel voices. Specifically, listeners studied voices of unfamiliar speakers uttering short sentences and subsequently classified studied and novel voices as "old" or "new" in a recognition test. To investigate "pure" voice learning, i.e., independent of sentence meaning, we presented German sentence stimuli to non-German speaking listeners. To disentangle stimulus-invariant and stimulus-dependent learning, during the test phase we contrasted a "same sentence" condition in which listeners heard speakers repeating the sentences from the preceding study phase, with a "different sentence" condition. Voice recognition performance was above chance in both conditions although, as expected, performance was higher for same than for different sentences. During study phases activity in the left inferior frontal gyrus (IFG) was related to subsequent voice recognition performance and same versus different sentence condition, suggesting an involvement of the left IFG in the interactive processing of speaker and speech information during learning. Importantly, at test reduced activation for voices correctly classified as "old" compared to "new" emerged in a network of brain areas including temporal voice areas (TVAs) of the right posterior superior temporal gyrus (pSTG), as well as the right inferior/middle frontal gyrus (IFG/MFG), the right medial frontal gyrus, and the left caudate. This effect of voice novelty did not interact with sentence condition, suggesting a role of temporal voice-selective areas and extra-temporal areas in the explicit recognition of learned voice identity, independent of speech content. Copyright © 2017 Elsevier Ltd. All rights reserved.
Chun, Audrey; Reinhardt, Joann P; Ramirez, Mildred; Ellis, Julie M; Silver, Stephanie; Burack, Orah; Eimicke, Joseph P; Cimarolli, Verena; Teresi, Jeanne A
2017-12-01
To examine agreement between Minimum Data Set clinician ratings and researcher assessments of depression among ethnically diverse nursing home residents using the 9-item Patient Health Questionnaire. Although depression is common among nursing homes residents, its recognition remains a challenge. Observational baseline data from a longitudinal intervention study. Sample of 155 residents from 12 long-term care units in one US facility; 50 were interviewed in Spanish. Convergence between clinician and researcher ratings was examined for (i) self-report capacity, (ii) suicidal ideation, (iii) at least moderate depression, (iv) Patient Health Questionnaire severity scores. Experiences by clinical raters using the depression assessment were analysed. The intraclass correlation coefficient was used to examine concordance and Cohen's kappa to examine agreement between clinicians and researchers. Moderate agreement (κ = 0.52) was observed in determination of capacity and poor to fair agreement in reporting suicidal ideation (κ = 0.10-0.37) across time intervals. Poor agreement was observed in classification of at least moderate depression (κ = -0.02 to 0.24), lower than the maximum kappa obtainable (0.58-0.85). Eight assessors indicated problems assessing Spanish-speaking residents. Among Spanish speakers, researchers identified 16% with Patient Health Questionnaire scores of 10 or greater, and 14% with thoughts of self-harm whilst clinicians identified 6% and 0%, respectively. This study advances the field of depression recognition in long-term care by identification of possible challenges in assessing Spanish speakers. Use of the Patient Health Questionnaire requires further investigation, particularly among non-English speakers. Depression screening for ethnically diverse nursing home residents is required, as underreporting of depression and suicidal ideation among Spanish speakers may result in lack of depression recognition and referral for evaluation and treatment. Training in depression recognition is imperative to improve the recognition, evaluation and treatment of depression in older people living in nursing homes. © 2017 John Wiley & Sons Ltd.
Gender Differences in the Recognition of Vocal Emotions
Lausen, Adi; Schacht, Annekathrin
2018-01-01
The conflicting findings from the few studies conducted with regard to gender differences in the recognition of vocal expressions of emotion have left the exact nature of these differences unclear. Several investigators have argued that a comprehensive understanding of gender differences in vocal emotion recognition can only be achieved by replicating these studies while accounting for influential factors such as stimulus type, gender-balanced samples, number of encoders, decoders, and emotional categories. This study aimed to account for these factors by investigating whether emotion recognition from vocal expressions differs as a function of both listeners' and speakers' gender. A total of N = 290 participants were randomly and equally allocated to two groups. One group listened to words and pseudo-words, while the other group listened to sentences and affect bursts. Participants were asked to categorize the stimuli with respect to the expressed emotions in a fixed-choice response format. Overall, females were more accurate than males when decoding vocal emotions, however, when testing for specific emotions these differences were small in magnitude. Speakers' gender had a significant impact on how listeners' judged emotions from the voice. The group listening to words and pseudo-words had higher identification rates for emotions spoken by male than by female actors, whereas in the group listening to sentences and affect bursts the identification rates were higher when emotions were uttered by female than male actors. The mixed pattern for emotion-specific effects, however, indicates that, in the vocal channel, the reliability of emotion judgments is not systematically influenced by speakers' gender and the related stereotypes of emotional expressivity. Together, these results extend previous findings by showing effects of listeners' and speakers' gender on the recognition of vocal emotions. They stress the importance of distinguishing these factors to explain recognition ability in the processing of emotional prosody. PMID:29922202
Agarwalla, Swapna; Sarma, Kandarpa Kumar
2016-06-01
Automatic Speaker Recognition (ASR) and related issues are continuously evolving as inseparable elements of Human Computer Interaction (HCI). With assimilation of emerging concepts like big data and Internet of Things (IoT) as extended elements of HCI, ASR techniques are found to be passing through a paradigm shift. Oflate, learning based techniques have started to receive greater attention from research communities related to ASR owing to the fact that former possess natural ability to mimic biological behavior and that way aids ASR modeling and processing. The current learning based ASR techniques are found to be evolving further with incorporation of big data, IoT like concepts. Here, in this paper, we report certain approaches based on machine learning (ML) used for extraction of relevant samples from big data space and apply them for ASR using certain soft computing techniques for Assamese speech with dialectal variations. A class of ML techniques comprising of the basic Artificial Neural Network (ANN) in feedforward (FF) and Deep Neural Network (DNN) forms using raw speech, extracted features and frequency domain forms are considered. The Multi Layer Perceptron (MLP) is configured with inputs in several forms to learn class information obtained using clustering and manual labeling. DNNs are also used to extract specific sentence types. Initially, from a large storage, relevant samples are selected and assimilated. Next, a few conventional methods are used for feature extraction of a few selected types. The features comprise of both spectral and prosodic types. These are applied to Recurrent Neural Network (RNN) and Fully Focused Time Delay Neural Network (FFTDNN) structures to evaluate their performance in recognizing mood, dialect, speaker and gender variations in dialectal Assamese speech. The system is tested under several background noise conditions by considering the recognition rates (obtained using confusion matrices and manually) and computation time. It is found that the proposed ML based sentence extraction techniques and the composite feature set used with RNN as classifier outperform all other approaches. By using ANN in FF form as feature extractor, the performance of the system is evaluated and a comparison is made. Experimental results show that the application of big data samples has enhanced the learning of the ASR system. Further, the ANN based sample and feature extraction techniques are found to be efficient enough to enable application of ML techniques in big data aspects as part of ASR systems. Copyright © 2015 Elsevier Ltd. All rights reserved.
Military applications of automatic speech recognition and future requirements
NASA Technical Reports Server (NTRS)
Beek, Bruno; Cupples, Edward J.
1977-01-01
An updated summary of the state-of-the-art of automatic speech recognition and its relevance to military applications is provided. A number of potential systems for military applications are under development. These include: (1) digital narrowband communication systems; (2) automatic speech verification; (3) on-line cartographic processing unit; (4) word recognition for militarized tactical data system; and (5) voice recognition and synthesis for aircraft cockpit.
ICPR-2016 - International Conference on Pattern Recognition
Learning for Scene Understanding" Speakers ICPR2016 PAPER AWARDS Best Piero Zamperoni Student Paper -Paced Dictionary Learning for Cross-Domain Retrieval and Recognition Xu, Dan; Song, Jingkuan; Alameda discussions on recent advances in the fields of Pattern Recognition, Machine Learning and Computer Vision, and
NASA Astrophysics Data System (ADS)
Tovarek, Jaromir; Partila, Pavol
2017-05-01
This article discusses the speaker identification for the improvement of the security communication between law enforcement units. The main task of this research was to develop the text-independent speaker identification system which can be used for real-time recognition. This system is designed for identification in the open set. It means that the unknown speaker can be anyone. Communication itself is secured, but we have to check the authorization of the communication parties. We have to decide if the unknown speaker is the authorized for the given action. The calls are recorded by IP telephony server and then these recordings are evaluate using classification If the system evaluates that the speaker is not authorized, it sends a warning message to the administrator. This message can detect, for example a stolen phone or other unusual situation. The administrator then performs the appropriate actions. Our novel proposal system uses multilayer neural network for classification and it consists of three layers (input layer, hidden layer, and output layer). A number of neurons in input layer corresponds with the length of speech features. Output layer then represents classified speakers. Artificial Neural Network classifies speech signal frame by frame, but the final decision is done over the complete record. This rule substantially increases accuracy of the classification. Input data for the neural network are a thirteen Mel-frequency cepstral coefficients, which describe the behavior of the vocal tract. These parameters are the most used for speaker recognition. Parameters for training, testing and validation were extracted from recordings of authorized users. Recording conditions for training data correspond with the real traffic of the system (sampling frequency, bit rate). The main benefit of the research is the system developed for text-independent speaker identification which is applied to secure communication between law enforcement units.
Unsupervised real-time speaker identification for daily movies
NASA Astrophysics Data System (ADS)
Li, Ying; Kuo, C.-C. Jay
2002-07-01
The problem of identifying speakers for movie content analysis is addressed in this paper. While most previous work on speaker identification was carried out in a supervised mode using pure audio data, more robust results can be obtained in real-time by integrating knowledge from multiple media sources in an unsupervised mode. In this work, both audio and visual cues will be employed and subsequently combined in a probabilistic framework to identify speakers. Particularly, audio information is used to identify speakers with a maximum likelihood (ML)-based approach while visual information is adopted to distinguish speakers by detecting and recognizing their talking faces based on face detection/recognition and mouth tracking techniques. Moreover, to accommodate for speakers' acoustic variations along time, we update their models on the fly by adapting to their newly contributed speech data. Encouraging results have been achieved through extensive experiments, which shows a promising future of the proposed audiovisual-based unsupervised speaker identification system.
Speakers of Different Languages Process the Visual World Differently
Chabal, Sarah; Marian, Viorica
2015-01-01
Language and vision are highly interactive. Here we show that people activate language when they perceive the visual world, and that this language information impacts how speakers of different languages focus their attention. For example, when searching for an item (e.g., clock) in the same visual display, English and Spanish speakers look at different objects. Whereas English speakers searching for the clock also look at a cloud, Spanish speakers searching for the clock also look at a gift, because the Spanish names for gift (regalo) and clock (reloj) overlap phonologically. These different looking patterns emerge despite an absence of direct linguistic input, showing that language is automatically activated by visual scene processing. We conclude that the varying linguistic information available to speakers of different languages affects visual perception, leading to differences in how the visual world is processed. PMID:26030171
Do Listeners Store in Memory a Speaker's Habitual Utterance-Final Phonation Type?
Bőhm, Tamás; Shattuck-Hufnagel, Stefanie
2009-01-01
Earlier studies report systematic differences across speakers in the occurrence of utterance-final irregular phonation; the work reported here investigated whether human listeners remember this speaker-specific information and can access it when necessary (a prerequisite for using this cue in speaker recognition). Listeners personally familiar with the voices of the speakers were presented with pairs of speech samples: one with the original and the other with transformed final phonation type. Asked to select the member of the pair that was closer to the talker's voice, most listeners tended to choose the unmanipulated token (even though they judged them to sound essentially equally natural). This suggests that utterance-final pitch period irregularity is part of the mental representation of individual speaker voices, although this may depend on the individual speaker and listener to some extent. PMID:19776665
Automatic detection of articulation disorders in children with cleft lip and palate.
Maier, Andreas; Hönig, Florian; Bocklet, Tobias; Nöth, Elmar; Stelzle, Florian; Nkenke, Emeka; Schuster, Maria
2009-11-01
Speech of children with cleft lip and palate (CLP) is sometimes still disordered even after adequate surgical and nonsurgical therapies. Such speech shows complex articulation disorders, which are usually assessed perceptually, consuming time and manpower. Hence, there is a need for an easy to apply and reliable automatic method. To create a reference for an automatic system, speech data of 58 children with CLP were assessed perceptually by experienced speech therapists for characteristic phonetic disorders at the phoneme level. The first part of the article aims to detect such characteristics by a semiautomatic procedure and the second to evaluate a fully automatic, thus simple, procedure. The methods are based on a combination of speech processing algorithms. The semiautomatic method achieves moderate to good agreement (kappa approximately 0.6) for the detection of all phonetic disorders. On a speaker level, significant correlations between the perceptual evaluation and the automatic system of 0.89 are obtained. The fully automatic system yields a correlation on the speaker level of 0.81 to the perceptual evaluation. This correlation is in the range of the inter-rater correlation of the listeners. The automatic speech evaluation is able to detect phonetic disorders at an experts'level without any additional human postprocessing.
L2 Word Recognition: Influence of L1 Orthography on Multi-syllabic Word Recognition.
Hamada, Megumi
2017-10-01
L2 reading research suggests that L1 orthographic experience influences L2 word recognition. Nevertheless, the findings on multi-syllabic words in English are still limited despite the fact that a vast majority of words are multi-syllabic. The study investigated whether L1 orthography influences the recognition of multi-syllabic words, focusing on the position of an embedded word. The participants were Arabic ESL learners, Chinese ESL learners, and native speakers of English. The task was a word search task, in which the participants identified a target word embedded in a pseudoword at the initial, middle, or final position. The search accuracy and speed indicated that all groups showed a strong preference for the initial position. The accuracy data further indicated group differences. The Arabic group showed higher accuracy in the final than middle, while the Chinese group showed the opposite and the native speakers showed no difference between the two positions. The findings suggest that L2 multi-syllabic word recognition involves unique processes.
Revis, J; Galant, C; Fredouille, C; Ghio, A; Giovanni, A
2012-01-01
Widely studied in terms of perception, acoustics or aerodynamics, dysphonia stays nevertheless a speech phenomenon, closely related to the phonetic composition of the message conveyed by the voice. In this paper, we present a series of three works with the aim to understand the implications of the phonetic manifestation of dysphonia. Our first study proposes a new approach to the perceptual analysis of dysphonia (the phonetic labeling), which principle is to listen and evaluate each phoneme in a sentence separately. This study confirms the hypothesis of Laver that the dysphonia is not a constant noise added to the speech signal, but a discontinuous phenomenon, occurring on certain phonemes, based on the phonetic context. However, the burden of executing the task has led us to look to the techniques of automatic speaker recognition (ASR) to automate the procedure. With the collaboration of the LIA, we have developed a system for automatic classification of dysphonia from the techniques of ASR. This is the subject of our second study. The first results obtained with this system suggest that the unvoiced consonants show predominant performance in the task of automatic classification of dysphonia. This result is surprising since it is often assumed that dysphonia occurs only on laryngeal vibration. We started looking for explanations of this phenomenon and we present our assumptions and experiences in the third work we present.
Discriminative analysis of lip motion features for speaker identification and speech-reading.
Cetingül, H Ertan; Yemez, Yücel; Erzin, Engin; Tekalp, A Murat
2006-10-01
There have been several studies that jointly use audio, lip intensity, and lip geometry information for speaker identification and speech-reading applications. This paper proposes using explicit lip motion information, instead of or in addition to lip intensity and/or geometry information, for speaker identification and speech-reading within a unified feature selection and discrimination analysis framework, and addresses two important issues: 1) Is using explicit lip motion information useful, and, 2) if so, what are the best lip motion features for these two applications? The best lip motion features for speaker identification are considered to be those that result in the highest discrimination of individual speakers in a population, whereas for speech-reading, the best features are those providing the highest phoneme/word/phrase recognition rate. Several lip motion feature candidates have been considered including dense motion features within a bounding box about the lip, lip contour motion features, and combination of these with lip shape features. Furthermore, a novel two-stage, spatial, and temporal discrimination analysis is introduced to select the best lip motion features for speaker identification and speech-reading applications. Experimental results using an hidden-Markov-model-based recognition system indicate that using explicit lip motion information provides additional performance gains in both applications, and lip motion features prove more valuable in the case of speech-reading application.
Support vector machine for automatic pain recognition
NASA Astrophysics Data System (ADS)
Monwar, Md Maruf; Rezaei, Siamak
2009-02-01
Facial expressions are a key index of emotion and the interpretation of such expressions of emotion is critical to everyday social functioning. In this paper, we present an efficient video analysis technique for recognition of a specific expression, pain, from human faces. We employ an automatic face detector which detects face from the stored video frame using skin color modeling technique. For pain recognition, location and shape features of the detected faces are computed. These features are then used as inputs to a support vector machine (SVM) for classification. We compare the results with neural network based and eigenimage based automatic pain recognition systems. The experiment results indicate that using support vector machine as classifier can certainly improve the performance of automatic pain recognition system.
Memory for vocal tempo and pitch.
Boltz, Marilyn G
2017-11-01
Two experiments examined the ability to remember the vocal tempo and pitch of different individuals, and the way this information is encoded into the cognitive system. In both studies, participants engaged in an initial familiarisation phase while attending was systematically directed towards different aspects of speakers' voices. Afterwards, they received a tempo or pitch recognition task. Experiment 1 showed that tempo and pitch are both incidentally encoded into memory at levels comparable to intentional learning, and no performance deficit occurs with divided attending. Experiment 2 examined the ability to recognise pitch or tempo when the two dimensions co-varied and found that the presence of one influenced the other: performance was best when both dimensions were positively correlated with one another. As a set, these findings indicate that pitch and tempo are automatically processed in a holistic, integral fashion [Garner, W. R. (1974). The processing of information and structure. Potomac, MD: Erlbaum.] which has a number of cognitive implications.
The Resolution of Visual Noise in Word Recognition
ERIC Educational Resources Information Center
Pae, Hye K.; Lee, Yong-Won
2015-01-01
This study examined lexical processing in English by native speakers of Korean and Chinese, compared to that of native speakers of English, using normal, alternated, and inverse fonts. Sixty four adult students participated in a lexical decision task. The findings demonstrated similarities and differences in accuracy and latency among the three L1…
Speakers of different languages process the visual world differently.
Chabal, Sarah; Marian, Viorica
2015-06-01
Language and vision are highly interactive. Here we show that people activate language when they perceive the visual world, and that this language information impacts how speakers of different languages focus their attention. For example, when searching for an item (e.g., clock) in the same visual display, English and Spanish speakers look at different objects. Whereas English speakers searching for the clock also look at a cloud, Spanish speakers searching for the clock also look at a gift, because the Spanish names for gift (regalo) and clock (reloj) overlap phonologically. These different looking patterns emerge despite an absence of direct language input, showing that linguistic information is automatically activated by visual scene processing. We conclude that the varying linguistic information available to speakers of different languages affects visual perception, leading to differences in how the visual world is processed. (c) 2015 APA, all rights reserved).
ERIC Educational Resources Information Center
Wigmore, Angela; Hunter, Gordon; Pflugel, Eckhard; Denholm-Price, James; Binelli, Vincent
2009-01-01
Speech technology--especially automatic speech recognition--has now advanced to a level where it can be of great benefit both to able-bodied people and those with various disabilities. In this paper we describe an application "TalkMaths" which, using the output from a commonly-used conventional automatic speech recognition system,…
Speaker Recognition by Combining MFCC and Phase Information in Noisy Conditions
NASA Astrophysics Data System (ADS)
Wang, Longbiao; Minami, Kazue; Yamamoto, Kazumasa; Nakagawa, Seiichi
In this paper, we investigate the effectiveness of phase for speaker recognition in noisy conditions and combine the phase information with mel-frequency cepstral coefficients (MFCCs). To date, almost speaker recognition methods are based on MFCCs even in noisy conditions. For MFCCs which dominantly capture vocal tract information, only the magnitude of the Fourier Transform of time-domain speech frames is used and phase information has been ignored. High complement of the phase information and MFCCs is expected because the phase information includes rich voice source information. Furthermore, some researches have reported that phase based feature was robust to noise. In our previous study, a phase information extraction method that normalizes the change variation in the phase depending on the clipping position of the input speech was proposed, and the performance of the combination of the phase information and MFCCs was remarkably better than that of MFCCs. In this paper, we evaluate the robustness of the proposed phase information for speaker identification in noisy conditions. Spectral subtraction, a method skipping frames with low energy/Signal-to-Noise (SN) and noisy speech training models are used to analyze the effect of the phase information and MFCCs in noisy conditions. The NTT database and the JNAS (Japanese Newspaper Article Sentences) database added with stationary/non-stationary noise were used to evaluate our proposed method. MFCCs outperformed the phase information for clean speech. On the other hand, the degradation of the phase information was significantly smaller than that of MFCCs for noisy speech. The individual result of the phase information was even better than that of MFCCs in many cases by clean speech training models. By deleting unreliable frames (frames having low energy/SN), the speaker identification performance was improved significantly. By integrating the phase information with MFCCs, the speaker identification error reduction rate was about 30%-60% compared with the standard MFCC-based method.
Calandruccio, Lauren; Bradlow, Ann R; Dhar, Sumitrajit
2014-04-01
Masking release for an English sentence-recognition task in the presence of foreign-accented English speech compared with native-accented English speech was reported in Calandruccio et al (2010a). The masking release appeared to increase as the masker intelligibility decreased. However, it could not be ruled out that spectral differences between the speech maskers were influencing the significant differences observed. The purpose of the current experiment was to minimize spectral differences between speech maskers to determine how various amounts of linguistic information within competing speech Affiliationect masking release. A mixed-model design with within-subject (four two-talker speech maskers) and between-subject (listener group) factors was conducted. Speech maskers included native-accented English speech and high-intelligibility, moderate-intelligibility, and low-intelligibility Mandarin-accented English. Normalizing the long-term average speech spectra of the maskers to each other minimized spectral differences between the masker conditions. Three listener groups were tested, including monolingual English speakers with normal hearing, nonnative English speakers with normal hearing, and monolingual English speakers with hearing loss. The nonnative English speakers were from various native language backgrounds, not including Mandarin (or any other Chinese dialect). Listeners with hearing loss had symmetric mild sloping to moderate sensorineural hearing loss. Listeners were asked to repeat back sentences that were presented in the presence of four different two-talker speech maskers. Responses were scored based on the key words within the sentences (100 key words per masker condition). A mixed-model regression analysis was used to analyze the difference in performance scores between the masker conditions and listener groups. Monolingual English speakers with normal hearing benefited when the competing speech signal was foreign accented compared with native accented, allowing for improved speech recognition. Various levels of intelligibility across the foreign-accented speech maskers did not influence results. Neither the nonnative English-speaking listeners with normal hearing nor the monolingual English speakers with hearing loss benefited from masking release when the masker was changed from native-accented to foreign-accented English. Slight modifications between the target and the masker speech allowed monolingual English speakers with normal hearing to improve their recognition of native-accented English, even when the competing speech was highly intelligible. Further research is needed to determine which modifications within the competing speech signal caused the Mandarin-accented English to be less effective with respect to masking. Determining the influences within the competing speech that make it less effective as a masker or determining why monolingual normal-hearing listeners can take advantage of these differences could help improve speech recognition for those with hearing loss in the future. American Academy of Audiology.
Crossmodal and incremental perception of audiovisual cues to emotional speech.
Barkhuysen, Pashiera; Krahmer, Emiel; Swerts, Marc
2010-01-01
In this article we report on two experiments about the perception of audiovisual cues to emotional speech. The article addresses two questions: 1) how do visual cues from a speaker's face to emotion relate to auditory cues, and (2) what is the recognition speed for various facial cues to emotion? Both experiments reported below are based on tests with video clips of emotional utterances collected via a variant of the well-known Velten method. More specifically, we recorded speakers who displayed positive or negative emotions, which were congruent or incongruent with the (emotional) lexical content of the uttered sentence. In order to test this, we conducted two experiments. The first experiment is a perception experiment in which Czech participants, who do not speak Dutch, rate the perceived emotional state of Dutch speakers in a bimodal (audiovisual) or a unimodal (audio- or vision-only) condition. It was found that incongruent emotional speech leads to significantly more extreme perceived emotion scores than congruent emotional speech, where the difference between congruent and incongruent emotional speech is larger for the negative than for the positive conditions. Interestingly, the largest overall differences between congruent and incongruent emotions were found for the audio-only condition, which suggests that posing an incongruent emotion has a particularly strong effect on the spoken realization of emotions. The second experiment uses a gating paradigm to test the recognition speed for various emotional expressions from a speaker's face. In this experiment participants were presented with the same clips as experiment I, but this time presented vision-only. The clips were shown in successive segments (gates) of increasing duration. Results show that participants are surprisingly accurate in their recognition of the various emotions, as they already reach high recognition scores in the first gate (after only 160 ms). Interestingly, the recognition scores raise faster for positive than negative conditions. Finally, the gating results suggest that incongruent emotions are perceived as more intense than congruent emotions, as the former get more extreme recognition scores than the latter, already after a short period of exposure.
Second language experience modulates word retrieval effort in bilinguals: evidence from pupillometry
Schmidtke, Jens
2014-01-01
Bilingual speakers often have less language experience compared to monolinguals as a result of speaking two languages and/or a later age of acquisition of the second language. This may result in weaker and less precise phonological representations of words in memory, which may cause greater retrieval effort during spoken word recognition. To gauge retrieval effort, the present study compared the effects of word frequency, neighborhood density (ND), and level of English experience by testing monolingual English speakers and native Spanish speakers who differed in their age of acquisition of English (early/late). In the experimental paradigm, participants heard English words and matched them to one of four pictures while the pupil size, an indication of cognitive effort, was recorded. Overall, both frequency and ND effects could be observed in the pupil response, indicating that lower frequency and higher ND were associated with greater retrieval effort. Bilingual speakers showed an overall delayed pupil response and a larger ND effect compared to the monolingual speakers. The frequency effect was the same in early bilinguals and monolinguals but was larger in late bilinguals. Within the group of bilingual speakers, higher English proficiency was associated with an earlier pupil response in addition to a smaller frequency and ND effect. These results suggest that greater retrieval effort associated with bilingualism may be a consequence of reduced language experience rather than constitute a categorical bilingual disadvantage. Future avenues for the use of pupillometry in the field of spoken word recognition are discussed. PMID:24600428
Schall, Sonja; von Kriegstein, Katharina
2014-01-01
It has been proposed that internal simulation of the talking face of visually-known speakers facilitates auditory speech recognition. One prediction of this view is that brain areas involved in auditory-only speech comprehension interact with visual face-movement sensitive areas, even under auditory-only listening conditions. Here, we test this hypothesis using connectivity analyses of functional magnetic resonance imaging (fMRI) data. Participants (17 normal participants, 17 developmental prosopagnosics) first learned six speakers via brief voice-face or voice-occupation training (<2 min/speaker). This was followed by an auditory-only speech recognition task and a control task (voice recognition) involving the learned speakers’ voices in the MRI scanner. As hypothesized, we found that, during speech recognition, familiarity with the speaker’s face increased the functional connectivity between the face-movement sensitive posterior superior temporal sulcus (STS) and an anterior STS region that supports auditory speech intelligibility. There was no difference between normal participants and prosopagnosics. This was expected because previous findings have shown that both groups use the face-movement sensitive STS to optimize auditory-only speech comprehension. Overall, the present findings indicate that learned visual information is integrated into the analysis of auditory-only speech and that this integration results from the interaction of task-relevant face-movement and auditory speech-sensitive areas. PMID:24466026
Distant Speech Recognition Using a Microphone Array Network
NASA Astrophysics Data System (ADS)
Nakano, Alberto Yoshihiro; Nakagawa, Seiichi; Yamamoto, Kazumasa
In this work, spatial information consisting of the position and orientation angle of an acoustic source is estimated by an artificial neural network (ANN). The estimated position of a speaker in an enclosed space is used to refine the estimated time delays for a delay-and-sum beamformer, thus enhancing the output signal. On the other hand, the orientation angle is used to restrict the lexicon used in the recognition phase, assuming that the speaker faces a particular direction while speaking. To compensate the effect of the transmission channel inside a short frame analysis window, a new cepstral mean normalization (CMN) method based on a Gaussian mixture model (GMM) is investigated and shows better performance than the conventional CMN for short utterances. The performance of the proposed method is evaluated through Japanese digit/command recognition experiments.
ELF on a Mushroom: The Overnight Growth in English as a Lingua Franca
ERIC Educational Resources Information Center
Sowden, Colin
2012-01-01
In an effort to curtail native-speaker dominance of global English, and in recognition of the growing role of the language among non-native speakers from different first-language backgrounds, some academics have been urging the teaching of English as a Lingua Franca (ELF). Although at first this proposal seems to offer a plausible alternative to…
Yu, Chengzhu; Hansen, John H L
2017-03-01
Human physiology has evolved to accommodate environmental conditions, including temperature, pressure, and air chemistry unique to Earth. However, the environment in space varies significantly compared to that on Earth and, therefore, variability is expected in astronauts' speech production mechanism. In this study, the variations of astronaut voice characteristics during the NASA Apollo 11 mission are analyzed. Specifically, acoustical features such as fundamental frequency and phoneme formant structure that are closely related to the speech production system are studied. For a further understanding of astronauts' vocal tract spectrum variation in space, a maximum likelihood frequency warping based analysis is proposed to detect the vocal tract spectrum displacement during space conditions. The results from fundamental frequency, formant structure, as well as vocal spectrum displacement indicate that astronauts change their speech production mechanism when in space. Moreover, the experimental results for astronaut voice identification tasks indicate that current speaker recognition solutions are highly vulnerable to astronaut voice production variations in space conditions. Future recommendations from this study suggest that successful applications of speaker recognition during extended space missions require robust speaker modeling techniques that could effectively adapt to voice production variation caused by diverse space conditions.
Automatic Method of Pause Measurement for Normal and Dysarthric Speech
ERIC Educational Resources Information Center
Rosen, Kristin; Murdoch, Bruce; Folker, Joanne; Vogel, Adam; Cahill, Louise; Delatycki, Martin; Corben, Louise
2010-01-01
This study proposes an automatic method for the detection of pauses and identification of pause types in conversational speech for the purpose of measuring the effects of Friedreich's Ataxia (FRDA) on speech. Speech samples of [approximately] 3 minutes were recorded from 13 speakers with FRDA and 18 healthy controls. Pauses were measured from the…
NASA Astrophysics Data System (ADS)
Scharenborg, Odette; ten Bosch, Louis; Boves, Lou; Norris, Dennis
2003-12-01
This letter evaluates potential benefits of combining human speech recognition (HSR) and automatic speech recognition by building a joint model of an automatic phone recognizer (APR) and a computational model of HSR, viz., Shortlist [Norris, Cognition 52, 189-234 (1994)]. Experiments based on ``real-life'' speech highlight critical limitations posed by some of the simplifying assumptions made in models of human speech recognition. These limitations could be overcome by avoiding hard phone decisions at the output side of the APR, and by using a match between the input and the internal lexicon that flexibly copes with deviations from canonical phonemic representations.
Application of image recognition-based automatic hyphae detection in fungal keratitis.
Wu, Xuelian; Tao, Yuan; Qiu, Qingchen; Wu, Xinyi
2018-03-01
The purpose of this study is to evaluate the accuracy of two methods in diagnosis of fungal keratitis, whereby one method is automatic hyphae detection based on images recognition and the other method is corneal smear. We evaluate the sensitivity and specificity of the method in diagnosis of fungal keratitis, which is automatic hyphae detection based on image recognition. We analyze the consistency of clinical symptoms and the density of hyphae, and perform quantification using the method of automatic hyphae detection based on image recognition. In our study, 56 cases with fungal keratitis (just single eye) and 23 cases with bacterial keratitis were included. All cases underwent the routine inspection of slit lamp biomicroscopy, corneal smear examination, microorganism culture and the assessment of in vivo confocal microscopy images before starting medical treatment. Then, we recognize the hyphae images of in vivo confocal microscopy by using automatic hyphae detection based on image recognition to evaluate its sensitivity and specificity and compare with the method of corneal smear. The next step is to use the index of density to assess the severity of infection, and then find the correlation with the patients' clinical symptoms and evaluate consistency between them. The accuracy of this technology was superior to corneal smear examination (p < 0.05). The sensitivity of the technology of automatic hyphae detection of image recognition was 89.29%, and the specificity was 95.65%. The area under the ROC curve was 0.946. The correlation coefficient between the grading of the severity in the fungal keratitis by the automatic hyphae detection based on image recognition and the clinical grading is 0.87. The technology of automatic hyphae detection based on image recognition was with high sensitivity and specificity, able to identify fungal keratitis, which is better than the method of corneal smear examination. This technology has the advantages when compared with the conventional artificial identification of confocal microscope corneal images, of being accurate, stable and does not rely on human expertise. It was the most useful to the medical experts who are not familiar with fungal keratitis. The technology of automatic hyphae detection based on image recognition can quantify the hyphae density and grade this property. Being noninvasive, it can provide an evaluation criterion to fungal keratitis in a timely, accurate, objective and quantitative manner.
Calandruccio, Lauren; Bradlow, Ann R.; Dhar, Sumitrajit
2013-01-01
Background Masking release for an English sentence-recognition task in the presence of foreign-accented English speech compared to native-accented English speech was reported in Calandruccio, Dhar and Bradlow (2010). The masking release appeared to increase as the masker intelligibility decreased. However, it could not be ruled out that spectral differences between the speech maskers were influencing the significant differences observed. Purpose The purpose of the current experiment was to minimize spectral differences between speech maskers to determine how various amounts of linguistic information within competing speech affect masking release. Research Design A mixed model design with within- (four two-talker speech maskers) and between-subject (listener group) factors was conducted. Speech maskers included native-accented English speech, and high-intelligibility, moderate-intelligibility and low-intelligibility Mandarin-accented English. Normalizing the long-term average speech spectra of the maskers to each other minimized spectral differences between the masker conditions. Study Sample Three listener groups were tested including monolingual English speakers with normal hearing, non-native speakers of English with normal hearing, and monolingual speakers of English with hearing loss. The non-native speakers of English were from various native-language backgrounds, not including Mandarin (or any other Chinese dialect). Listeners with hearing loss had symmetrical, mild sloping to moderate sensorineural hearing loss. Data Collection and Analysis Listeners were asked to repeat back sentences that were presented in the presence of four different two-talker speech maskers. Responses were scored based on the keywords within the sentences (100 keywords/masker condition). A mixed-model regression analysis was used to analyze the difference in performance scores between the masker conditions and the listener groups. Results Monolingual speakers of English with normal hearing benefited when the competing speech signal was foreign-accented compared to native-accented allowing for improved speech recognition. Various levels of intelligibility across the foreign-accented speech maskers did not influence results. Neither the non-native English listeners with normal hearing, nor the monolingual English speakers with hearing loss benefited from masking release when the masker was changed from native-accented to foreign-accented English. Conclusions Slight modifications between the target and the masker speech allowed monolingual speakers of English with normal hearing to improve their recognition of native-accented English even when the competing speech was highly intelligible. Further research is needed to determine which modifications within the competing speech signal caused the Mandarin-accented English to be less effective with respect to masking. Determining the influences within the competing speech that make it less effective as a masker, or determining why monolingual normal-hearing listeners can take advantage of these differences could help improve speech recognition for those with hearing loss in the future. PMID:25126683
Word recognition materials for native speakers of Taiwan Mandarin.
Nissen, Shawn L; Harris, Richard W; Dukes, Alycia
2008-06-01
To select, digitally record, evaluate, and psychometrically equate word recognition materials that can be used to measure the speech perception abilities of native speakers of Taiwan Mandarin in quiet. Frequently used bisyllabic words produced by male and female talkers of Taiwan Mandarin were digitally recorded and subsequently evaluated using 20 native listeners with normal hearing at 10 intensity levels (-5 to 40 dB HL) in increments of 5 dB. Using logistic regression, 200 words with the steepest psychometric slopes were divided into 4 lists and 8 half-lists that were relatively equivalent in psychometric function slope. To increase auditory homogeneity of the lists, the intensity of words in each list was digitally adjusted so that the threshold of each list was equal to the midpoint between the mean thresholds of the male and female half-lists. Digital recordings of the word recognition lists and the associated clinical instructions are available on CD upon request.
Analysis and synthesis of intonation using the Tilt model.
Taylor, P
2000-03-01
This paper introduces the Tilt intonational model and describes how this model can be used to automatically analyze and synthesize intonation. In the model, intonation is represented as a linear sequence of events, which can be pitch accents or boundary tones. Each event is characterized by continuous parameters representing amplitude, duration, and tilt (a measure of the shape of the event). The paper describes an event detector, in effect an intonational recognition system, which produces a transcription of an utterance's intonation. The features and parameters of the event detector are discussed and performance figures are shown on a variety of read and spontaneous speaker independent conversational speech databases. Given the event locations, algorithms are described which produce an automatic analysis of each event in terms of the Tilt parameters. Synthesis algorithms are also presented which generate F0 contours from Tilt representations. The accuracy of these is shown by comparing synthetic F0 contours to real F0 contours. The paper concludes with an extensive discussion on linguistic representations of intonation and gives evidence that the Tilt model goes a long way to satisfying the desired goals of such a representation in that it has the right number of degrees of freedom to be able to describe and synthesize intonation accurately.
Tuning time-frequency methods for the detection of metered HF speech
NASA Astrophysics Data System (ADS)
Nelson, Douglas J.; Smith, Lawrence H.
2002-12-01
Speech is metered if the stresses occur at a nearly regular rate. Metered speech is common in poetry, and it can occur naturally in speech, if the speaker is spelling a word or reciting words or numbers from a list. In radio communications, the CQ request, call sign and other codes are frequently metered. In tactical communications and air traffic control, location, heading and identification codes may be metered. Moreover metering may be expected to survive even in HF communications, which are corrupted by noise, interference and mistuning. For this environment, speech recognition and conventional machine-based methods are not effective. We describe Time-Frequency methods which have been adapted successfully to the problem of mitigation of HF signal conditions and detection of metered speech. These methods are based on modeled time and frequency correlation properties of nearly harmonic functions. We derive these properties and demonstrate a performance gain over conventional correlation and spectral methods. Finally, in addressing the problem of HF single sideband (SSB) communications, the problems of carrier mistuning, interfering signals, such as manual Morse, and fast automatic gain control (AGC) must be addressed. We demonstrate simple methods which may be used to blindly mitigate mistuning and narrowband interference, and effectively invert the fast automatic gain function.
A Hybrid Acoustic and Pronunciation Model Adaptation Approach for Non-native Speech Recognition
NASA Astrophysics Data System (ADS)
Oh, Yoo Rhee; Kim, Hong Kook
In this paper, we propose a hybrid model adaptation approach in which pronunciation and acoustic models are adapted by incorporating the pronunciation and acoustic variabilities of non-native speech in order to improve the performance of non-native automatic speech recognition (ASR). Specifically, the proposed hybrid model adaptation can be performed at either the state-tying or triphone-modeling level, depending at which acoustic model adaptation is performed. In both methods, we first analyze the pronunciation variant rules of non-native speakers and then classify each rule as either a pronunciation variant or an acoustic variant. The state-tying level hybrid method then adapts pronunciation models and acoustic models by accommodating the pronunciation variants in the pronunciation dictionary and by clustering the states of triphone acoustic models using the acoustic variants, respectively. On the other hand, the triphone-modeling level hybrid method initially adapts pronunciation models in the same way as in the state-tying level hybrid method; however, for the acoustic model adaptation, the triphone acoustic models are then re-estimated based on the adapted pronunciation models and the states of the re-estimated triphone acoustic models are clustered using the acoustic variants. From the Korean-spoken English speech recognition experiments, it is shown that ASR systems employing the state-tying and triphone-modeling level adaptation methods can relatively reduce the average word error rates (WERs) by 17.1% and 22.1% for non-native speech, respectively, when compared to a baseline ASR system.
Audiovisual cues benefit recognition of accented speech in noise but not perceptual adaptation.
Banks, Briony; Gowen, Emma; Munro, Kevin J; Adank, Patti
2015-01-01
Perceptual adaptation allows humans to recognize different varieties of accented speech. We investigated whether perceptual adaptation to accented speech is facilitated if listeners can see a speaker's facial and mouth movements. In Study 1, participants listened to sentences in a novel accent and underwent a period of training with audiovisual or audio-only speech cues, presented in quiet or in background noise. A control group also underwent training with visual-only (speech-reading) cues. We observed no significant difference in perceptual adaptation between any of the groups. To address a number of remaining questions, we carried out a second study using a different accent, speaker and experimental design, in which participants listened to sentences in a non-native (Japanese) accent with audiovisual or audio-only cues, without separate training. Participants' eye gaze was recorded to verify that they looked at the speaker's face during audiovisual trials. Recognition accuracy was significantly better for audiovisual than for audio-only stimuli; however, no statistical difference in perceptual adaptation was observed between the two modalities. Furthermore, Bayesian analysis suggested that the data supported the null hypothesis. Our results suggest that although the availability of visual speech cues may be immediately beneficial for recognition of unfamiliar accented speech in noise, it does not improve perceptual adaptation.
Dickinson, Ann-Marie; Baker, Richard; Siciliano, Catherine; Munro, Kevin J
2014-10-01
To identify which training approach, if any, is most effective for improving perception of frequency-compressed speech. A between-subject design using repeated measures. Forty young adults with normal hearing were randomly allocated to one of four groups: a training group (sentence or consonant) or a control group (passive exposure or test-only). Test and training material differed in terms of material and speaker. On average, sentence training and passive exposure led to significantly improved sentence recognition (11.0% and 11.7%, respectively) compared with the consonant training group (2.5%) and test-only group (0.4%), whilst, consonant training led to significantly improved consonant recognition (8.8%) compared with the sentence training group (1.9%), passive exposure group (2.8%), and test-only group (0.8%). Sentence training led to improved sentence recognition, whilst consonant training led to improved consonant recognition. This suggests learning transferred between speakers and material but not stimuli. Passive exposure to sentence material led to an improvement in sentence recognition that was equivalent to gains from active training. This suggests that it may be possible to adapt passively to frequency-compressed speech.
Do Bilinguals Automatically Activate Their Native Language When They Are Not Using It?
ERIC Educational Resources Information Center
Costa, Albert; Pannunzi, Mario; Deco, Gustavo; Pickering, Martin J.
2017-01-01
Most models of lexical access assume that bilingual speakers activate their two languages even when they are in a context in which only one language is used. A critical piece of evidence used to support this notion is the observation that a given word automatically activates its translation equivalent in the other language. Here, we argue that…
Open-set speaker identification with diverse-duration speech data
NASA Astrophysics Data System (ADS)
Karadaghi, Rawande; Hertlein, Heinz; Ariyaeeinia, Aladdin
2015-05-01
The concern in this paper is an important category of applications of open-set speaker identification in criminal investigation, which involves operating with short and varied duration speech. The study presents investigations into the adverse effects of such an operating condition on the accuracy of open-set speaker identification, based on both GMMUBM and i-vector approaches. The experiments are conducted using a protocol developed for the identification task, based on the NIST speaker recognition evaluation corpus of 2008. In order to closely cover the real-world operating conditions in the considered application area, the study includes experiments with various combinations of training and testing data duration. The paper details the characteristics of the experimental investigations conducted and provides a thorough analysis of the results obtained.
Application of automatic threshold in dynamic target recognition with low contrast
NASA Astrophysics Data System (ADS)
Miao, Hua; Guo, Xiaoming; Chen, Yu
2014-11-01
Hybrid photoelectric joint transform correlator can realize automatic real-time recognition with high precision through the combination of optical devices and electronic devices. When recognizing targets with low contrast using photoelectric joint transform correlator, because of the difference of attitude, brightness and grayscale between target and template, only four to five frames of dynamic targets can be recognized without any processing. CCD camera is used to capture the dynamic target images and the capturing speed of CCD is 25 frames per second. Automatic threshold has many advantages like fast processing speed, effectively shielding noise interference, enhancing diffraction energy of useful information and better reserving outline of target and template, so this method plays a very important role in target recognition with optical correlation method. However, the automatic obtained threshold by program can not achieve the best recognition results for dynamic targets. The reason is that outline information is broken to some extent. Optimal threshold is obtained by manual intervention in most cases. Aiming at the characteristics of dynamic targets, the processing program of improved automatic threshold is finished by multiplying OTSU threshold of target and template by scale coefficient of the processed image, and combining with mathematical morphology. The optimal threshold can be achieved automatically by improved automatic threshold processing for dynamic low contrast target images. The recognition rate of dynamic targets is improved through decreased background noise effect and increased correlation information. A series of dynamic tank images with the speed about 70 km/h are adapted as target images. The 1st frame of this series of tanks can correlate only with the 3rd frame without any processing. Through OTSU threshold, the 80th frame can be recognized. By automatic threshold processing of the joint images, this number can be increased to 89 frames. Experimental results show that the improved automatic threshold processing has special application value for the recognition of dynamic target with low contrast.
Processing of Acoustic Cues in Lexical-Tone Identification by Pediatric Cochlear-Implant Recipients
ERIC Educational Resources Information Center
Peng, Shu-Chen; Lu, Hui-Ping; Lu, Nelson; Lin, Yung-Song; Deroche, Mickael L. D.; Chatterjee, Monita
2017-01-01
Purpose: The objective was to investigate acoustic cue processing in lexical-tone recognition by pediatric cochlear-implant (CI) recipients who are native Mandarin speakers. Method: Lexical-tone recognition was assessed in pediatric CI recipients and listeners with normal hearing (NH) in 2 tasks. In Task 1, participants identified naturally…
L2 Gender Facilitation and Inhibition in Spoken Word Recognition
ERIC Educational Resources Information Center
Behney, Jennifer N.
2011-01-01
This dissertation investigates the role of grammatical gender facilitation and inhibition in second language (L2) learners' spoken word recognition. Native speakers of languages that have grammatical gender are sensitive to gender marking when hearing and recognizing a word. Gender facilitation refers to when a given noun that is preceded by an…
ERP Evidence of Hemispheric Independence in Visual Word Recognition
ERIC Educational Resources Information Center
Nemrodov, Dan; Harpaz, Yuval; Javitt, Daniel C.; Lavidor, Michal
2011-01-01
This study examined the capability of the left hemisphere (LH) and the right hemisphere (RH) to perform a visual recognition task independently as formulated by the Direct Access Model (Fernandino, Iacoboni, & Zaidel, 2007). Healthy native Hebrew speakers were asked to categorize nouns and non-words (created from nouns by transposing two middle…
Speaker diarization system on the 2007 NIST rich transcription meeting recognition evaluation
NASA Astrophysics Data System (ADS)
Sun, Hanwu; Nwe, Tin Lay; Koh, Eugene Chin Wei; Bin, Ma; Li, Haizhou
2007-09-01
This paper presents a speaker diarization system developed at the Institute for Infocomm Research (I2R) for NIST Rich Transcription 2007 (RT-07) evaluation task. We describe in details our primary approaches for the speaker diarization on the Multiple Distant Microphones (MDM) conditions in conference room scenario. Our proposed system consists of six modules: 1). Least-mean squared (NLMS) adaptive filter for the speaker direction estimate via Time Difference of Arrival (TDOA), 2). An initial speaker clustering via two-stage TDOA histogram distribution quantization approach, 3). Multiple microphone speaker data alignment via GCC-PHAT Time Delay Estimate (TDE) among all the distant microphone channel signals, 4). A speaker clustering algorithm based on GMM modeling approach, 5). Non-speech removal via speech/non-speech verification mechanism and, 6). Silence removal via "Double-Layer Windowing"(DLW) method. We achieves error rate of 31.02% on the 2006 Spring (RT-06s) MDM evaluation task and a competitive overall error rate of 15.32% for the NIST Rich Transcription 2007 (RT-07) MDM evaluation task.
Cai, Zhenguang G; Gilbert, Rebecca A; Davis, Matthew H; Gaskell, M Gareth; Farrar, Lauren; Adler, Sarah; Rodd, Jennifer M
2017-11-01
Speech carries accent information relevant to determining the speaker's linguistic and social background. A series of web-based experiments demonstrate that accent cues can modulate access to word meaning. In Experiments 1-3, British participants were more likely to retrieve the American dominant meaning (e.g., hat meaning of "bonnet") in a word association task if they heard the words in an American than a British accent. In addition, results from a speeded semantic decision task (Experiment 4) and sentence comprehension task (Experiment 5) confirm that accent modulates on-line meaning retrieval such that comprehension of ambiguous words is easier when the relevant word meaning is dominant in the speaker's dialect. Critically, neutral-accent speech items, created by morphing British- and American-accented recordings, were interpreted in a similar way to accented words when embedded in a context of accented words (Experiment 2). This finding indicates that listeners do not use accent to guide meaning retrieval on a word-by-word basis; instead they use accent information to determine the dialectic identity of a speaker and then use their experience of that dialect to guide meaning access for all words spoken by that person. These results motivate a speaker-model account of spoken word recognition in which comprehenders determine key characteristics of their interlocutor and use this knowledge to guide word meaning access. Copyright © 2017 The Authors. Published by Elsevier Inc. All rights reserved.
Gay- and Lesbian-Sounding Auditory Cues Elicit Stereotyping and Discrimination.
Fasoli, Fabio; Maass, Anne; Paladino, Maria Paola; Sulpizio, Simone
2017-07-01
The growing body of literature on the recognition of sexual orientation from voice ("auditory gaydar") is silent on the cognitive and social consequences of having a gay-/lesbian- versus heterosexual-sounding voice. We investigated this issue in four studies (overall N = 276), conducted in Italian language, in which heterosexual listeners were exposed to single-sentence voice samples of gay/lesbian and heterosexual speakers. In all four studies, listeners were found to make gender-typical inferences about traits and preferences of heterosexual speakers, but gender-atypical inferences about those of gay or lesbian speakers. Behavioral intention measures showed that listeners considered lesbian and gay speakers as less suitable for a leadership position, and male (but not female) listeners took distance from gay speakers. Together, this research demonstrates that having a gay/lesbian rather than heterosexual-sounding voice has tangible consequences for stereotyping and discrimination.
Severity-Based Adaptation with Limited Data for ASR to Aid Dysarthric Speakers
Mustafa, Mumtaz Begum; Salim, Siti Salwah; Mohamed, Noraini; Al-Qatab, Bassam; Siong, Chng Eng
2014-01-01
Automatic speech recognition (ASR) is currently used in many assistive technologies, such as helping individuals with speech impairment in their communication ability. One challenge in ASR for speech-impaired individuals is the difficulty in obtaining a good speech database of impaired speakers for building an effective speech acoustic model. Because there are very few existing databases of impaired speech, which are also limited in size, the obvious solution to build a speech acoustic model of impaired speech is by employing adaptation techniques. However, issues that have not been addressed in existing studies in the area of adaptation for speech impairment are as follows: (1) identifying the most effective adaptation technique for impaired speech; and (2) the use of suitable source models to build an effective impaired-speech acoustic model. This research investigates the above-mentioned two issues on dysarthria, a type of speech impairment affecting millions of people. We applied both unimpaired and impaired speech as the source model with well-known adaptation techniques like the maximum likelihood linear regression (MLLR) and the constrained-MLLR(C-MLLR). The recognition accuracy of each impaired speech acoustic model is measured in terms of word error rate (WER), with further assessments, including phoneme insertion, substitution and deletion rates. Unimpaired speech when combined with limited high-quality speech-impaired data improves performance of ASR systems in recognising severely impaired dysarthric speech. The C-MLLR adaptation technique was also found to be better than MLLR in recognising mildly and moderately impaired speech based on the statistical analysis of the WER. It was found that phoneme substitution was the biggest contributing factor in WER in dysarthric speech for all levels of severity. The results show that the speech acoustic models derived from suitable adaptation techniques improve the performance of ASR systems in recognising impaired speech with limited adaptation data. PMID:24466004
Data Selection for Within-Class Covariance Estimation
2016-09-08
recognition performance. While developers have typically exploited the vast archive of speaker labeled data available from earlier NIST evaluations...utterances from a large population of speakers. Fortunately, participants in NIST evaluations have access to a vast repository of legacy data from earlier...previous NIST evaluations. Training data for the UBM and T-matrix was obtained from the NIST Switchboard 2 phases 2-5 [12] and SRE04/05/06 utterances
Phonologically-based biomarkers for major depressive disorder
NASA Astrophysics Data System (ADS)
Trevino, Andrea Carolina; Quatieri, Thomas Francis; Malyska, Nicolas
2011-12-01
Of increasing importance in the civilian and military population is the recognition of major depressive disorder at its earliest stages and intervention before the onset of severe symptoms. Toward the goal of more effective monitoring of depression severity, we introduce vocal biomarkers that are derived automatically from phonologically-based measures of speech rate. To assess our measures, we use a 35-speaker free-response speech database of subjects treated for depression over a 6-week duration. We find that dissecting average measures of speech rate into phone-specific characteristics and, in particular, combined phone-duration measures uncovers stronger relationships between speech rate and depression severity than global measures previously reported for a speech-rate biomarker. Results of this study are supported by correlation of our measures with depression severity and classification of depression state with these vocal measures. Our approach provides a general framework for analyzing individual symptom categories through phonological units, and supports the premise that speaking rate can be an indicator of psychomotor retardation severity.
Liu, Hanjun; Wang, Emily Q.; Chen, Zhaocong; Liu, Peng; Larson, Charles R.; Huang, Dongfeng
2010-01-01
The purpose of this cross-language study was to examine whether the online control of voice fundamental frequency (F0) during vowel phonation is influenced by language experience. Native speakers of Cantonese and Mandarin, both tonal languages spoken in China, participated in the experiments. Subjects were asked to vocalize a vowel sound ∕u∕ at their comfortable habitual F0, during which their voice pitch was unexpectedly shifted (±50, ±100, ±200, or ±500 cents, 200 ms duration) and fed back instantaneously to them over headphones. The results showed that Cantonese speakers produced significantly smaller responses than Mandarin speakers when the stimulus magnitude varied from 200 to 500 cents. Further, response magnitudes decreased along with the increase in stimulus magnitude in Cantonese speakers, which was not observed in Mandarin speakers. These findings suggest that online control of voice F0 during vocalization is sensitive to language experience. Further, systematic modulations of vocal responses across stimulus magnitude were observed in Cantonese speakers but not in Mandarin speakers, which indicates that this highly automatic feedback mechanism is sensitive to the specific tonal system of each language. PMID:21218905
Leveraging Automatic Speech Recognition Errors to Detect Challenging Speech Segments in TED Talks
ERIC Educational Resources Information Center
Mirzaei, Maryam Sadat; Meshgi, Kourosh; Kawahara, Tatsuya
2016-01-01
This study investigates the use of Automatic Speech Recognition (ASR) systems to epitomize second language (L2) listeners' problems in perception of TED talks. ASR-generated transcripts of videos often involve recognition errors, which may indicate difficult segments for L2 listeners. This paper aims to discover the root-causes of the ASR errors…
2013-01-01
Background Most of the institutional and research information in the biomedical domain is available in the form of English text. Even in countries where English is an official language, such as the United States, language can be a barrier for accessing biomedical information for non-native speakers. Recent progress in machine translation suggests that this technique could help make English texts accessible to speakers of other languages. However, the lack of adequate specialized corpora needed to train statistical models currently limits the quality of automatic translations in the biomedical domain. Results We show how a large-sized parallel corpus can automatically be obtained for the biomedical domain, using the MEDLINE database. The corpus generated in this work comprises article titles obtained from MEDLINE and abstract text automatically retrieved from journal websites, which substantially extends the corpora used in previous work. After assessing the quality of the corpus for two language pairs (English/French and English/Spanish) we use the Moses package to train a statistical machine translation model that outperforms previous models for automatic translation of biomedical text. Conclusions We have built translation data sets in the biomedical domain that can easily be extended to other languages available in MEDLINE. These sets can successfully be applied to train statistical machine translation models. While further progress should be made by incorporating out-of-domain corpora and domain-specific lexicons, we believe that this work improves the automatic translation of biomedical texts. PMID:23631733
Suominen, Hanna; Johnson, Maree; Zhou, Liyuan; Sanchez, Paula; Sirel, Raul; Basilakis, Jim; Hanlen, Leif; Estival, Dominique; Dawson, Linda; Kelly, Barbara
2015-01-01
Objective We study the use of speech recognition and information extraction to generate drafts of Australian nursing-handover documents. Methods Speech recognition correctness and clinicians’ preferences were evaluated using 15 recorder–microphone combinations, six documents, three speakers, Dragon Medical 11, and five survey/interview participants. Information extraction correctness evaluation used 260 documents, six-class classification for each word, two annotators, and the CRF++ conditional random field toolkit. Results A noise-cancelling lapel-microphone with a digital voice recorder gave the best correctness (79%). This microphone was also the most preferred option by all but one participant. Although the participants liked the small size of this recorder, their preference was for tablets that can also be used for document proofing and sign-off, among other tasks. Accented speech was harder to recognize than native language and a male speaker was detected better than a female speaker. Information extraction was excellent in filtering out irrelevant text (85% F1) and identifying text relevant to two classes (87% and 70% F1). Similarly to the annotators’ disagreements, there was confusion between the remaining three classes, which explains the modest 62% macro-averaged F1. Discussion We present evidence for the feasibility of speech recognition and information extraction to support clinicians’ in entering text and unlock its content for computerized decision-making and surveillance in healthcare. Conclusions The benefits of this automation include storing all information; making the drafts available and accessible almost instantly to everyone with authorized access; and avoiding information loss, delays, and misinterpretations inherent to using a ward clerk or transcription services. PMID:25336589
Factors Impacting Recognition of False Collocations by Speakers of English as L1 and L2
ERIC Educational Resources Information Center
Makinina, Olga
2017-01-01
Currently there is a general uncertainty about what makes collocations (i.e., fixed word combinations with specific, not easily interpreted relations between their components) hard for ESL learners to master, and about how to improve collocation recognition and learning process. This study explored and designed a comparative classification of…
ERIC Educational Resources Information Center
Khateb, Asaid; Khateb-Abdelgani, Manal; Taha, Haitham Y.; Ibrahim, Raphiq
2014-01-01
This study aimed at assessing the effects of letters' connectivity in Arabic on visual word recognition. For this purpose, reaction times (RTs) and accuracy scores were collected from ninety-third, sixth and ninth grade native Arabic speakers during a lexical decision task, using fully connected (Cw), partially connected (PCw) and…
Detailed Phonetic Labeling of Multi-language Database for Spoken Language Processing Applications
2015-03-01
which contains about 60 interfering speakers as well as background music in a bar. The top panel is again clean training /noisy testing settings, and...recognition system for Mandarin was developed and tested. Character recognition rates as high as 88% were obtained, using an approximately 40 training ...Tool_ComputeFeat.m) .............................................................................................................. 50 6.3. Training
Semantic Ambiguity Effects in L2 Word Recognition
ERIC Educational Resources Information Center
Ishida, Tomomi
2018-01-01
The present study examined the ambiguity effects in second language (L2) word recognition. Previous studies on first language (L1) lexical processing have observed that ambiguous words are recognized faster and more accurately than unambiguous words on lexical decision tasks. In this research, L1 and L2 speakers of English were asked whether a…
ERIC Educational Resources Information Center
Calandruccio, Lauren; Zhou, Haibo
2014-01-01
Purpose: To examine whether improved speech recognition during linguistically mismatched target-masker experiments is due to linguistic unfamiliarity of the masker speech or linguistic dissimilarity between the target and masker speech. Method: Monolingual English speakers (n = 20) and English-Greek simultaneous bilinguals (n = 20) listened to…
Speech Processing and Recognition (SPaRe)
2011-01-01
results in the areas of automatic speech recognition (ASR), speech processing, machine translation (MT), natural language processing ( NLP ), and...Processing ( NLP ), Information Retrieval (IR) 16. SECURITY CLASSIFICATION OF: UNCLASSIFED 17. LIMITATION OF ABSTRACT 18. NUMBER OF PAGES 19a. NAME...Figure 9, the IOC was only expected to provide document submission and search; automatic speech recognition (ASR) for English, Spanish, Arabic , and
GMM-based speaker age and gender classification in Czech and Slovak
NASA Astrophysics Data System (ADS)
Přibil, Jiří; Přibilová, Anna; Matoušek, Jindřich
2017-01-01
The paper describes an experiment with using the Gaussian mixture models (GMM) for automatic classification of the speaker age and gender. It analyses and compares the influence of different number of mixtures and different types of speech features used for GMM gender/age classification. Dependence of the computational complexity on the number of used mixtures is also analysed. Finally, the GMM classification accuracy is compared with the output of the conventional listening tests. The results of these objective and subjective evaluations are in correspondence.
Speech endpoint detection with non-language speech sounds for generic speech processing applications
NASA Astrophysics Data System (ADS)
McClain, Matthew; Romanowski, Brian
2009-05-01
Non-language speech sounds (NLSS) are sounds produced by humans that do not carry linguistic information. Examples of these sounds are coughs, clicks, breaths, and filled pauses such as "uh" and "um" in English. NLSS are prominent in conversational speech, but can be a significant source of errors in speech processing applications. Traditionally, these sounds are ignored by speech endpoint detection algorithms, where speech regions are identified in the audio signal prior to processing. The ability to filter NLSS as a pre-processing step can significantly enhance the performance of many speech processing applications, such as speaker identification, language identification, and automatic speech recognition. In order to be used in all such applications, NLSS detection must be performed without the use of language models that provide knowledge of the phonology and lexical structure of speech. This is especially relevant to situations where the languages used in the audio are not known apriori. We present the results of preliminary experiments using data from American and British English speakers, in which segments of audio are classified as language speech sounds (LSS) or NLSS using a set of acoustic features designed for language-agnostic NLSS detection and a hidden-Markov model (HMM) to model speech generation. The results of these experiments indicate that the features and model used are capable of detection certain types of NLSS, such as breaths and clicks, while detection of other types of NLSS such as filled pauses will require future research.
Four-Channel Biosignal Analysis and Feature Extraction for Automatic Emotion Recognition
NASA Astrophysics Data System (ADS)
Kim, Jonghwa; André, Elisabeth
This paper investigates the potential of physiological signals as a reliable channel for automatic recognition of user's emotial state. For the emotion recognition, little attention has been paid so far to physiological signals compared to audio-visual emotion channels such as facial expression or speech. All essential stages of automatic recognition system using biosignals are discussed, from recording physiological dataset up to feature-based multiclass classification. Four-channel biosensors are used to measure electromyogram, electrocardiogram, skin conductivity and respiration changes. A wide range of physiological features from various analysis domains, including time/frequency, entropy, geometric analysis, subband spectra, multiscale entropy, etc., is proposed in order to search the best emotion-relevant features and to correlate them with emotional states. The best features extracted are specified in detail and their effectiveness is proven by emotion recognition results.
Bent, Tessa; Holt, Rachael Frush
2018-02-01
Children's ability to understand speakers with a wide range of dialects and accents is essential for efficient language development and communication in a global society. Here, the impact of regional dialect and foreign-accent variability on children's speech understanding was evaluated in both quiet and noisy conditions. Five- to seven-year-old children ( n = 90) and adults ( n = 96) repeated sentences produced by three speakers with different accents-American English, British English, and Japanese-accented English-in quiet or noisy conditions. Adults had no difficulty understanding any speaker in quiet conditions. Their performance declined for the nonnative speaker with a moderate amount of noise; their performance only substantially declined for the British English speaker (i.e., below 93% correct) when their understanding of the American English speaker was also impeded. In contrast, although children showed accurate word recognition for the American and British English speakers in quiet conditions, they had difficulty understanding the nonnative speaker even under ideal listening conditions. With a moderate amount of noise, their perception of British English speech declined substantially and their ability to understand the nonnative speaker was particularly poor. These results suggest that although school-aged children can understand unfamiliar native dialects under ideal listening conditions, their ability to recognize words in these dialects may be highly susceptible to the influence of environmental degradation. Fully adult-like word identification for speakers with unfamiliar accents and dialects may exhibit a protracted developmental trajectory.
Target recognition based on convolutional neural network
NASA Astrophysics Data System (ADS)
Wang, Liqiang; Wang, Xin; Xi, Fubiao; Dong, Jian
2017-11-01
One of the important part of object target recognition is the feature extraction, which can be classified into feature extraction and automatic feature extraction. The traditional neural network is one of the automatic feature extraction methods, while it causes high possibility of over-fitting due to the global connection. The deep learning algorithm used in this paper is a hierarchical automatic feature extraction method, trained with the layer-by-layer convolutional neural network (CNN), which can extract the features from lower layers to higher layers. The features are more discriminative and it is beneficial to the object target recognition.
Automatic recognition of postural allocations.
Sazonov, Edward; Krishnamurthy, Vidya; Makeyev, Oleksandr; Browning, Ray; Schutz, Yves; Hill, James
2007-01-01
A significant part of daily energy expenditure may be attributed to non-exercise activity thermogenesis and exercise activity thermogenesis. Automatic recognition of postural allocations such as standing or sitting can be used in behavioral modification programs aimed at minimizing static postures. In this paper we propose a shoe-based device and related pattern recognition methodology for recognition of postural allocations. Inexpensive technology allows implementation of this methodology as a part of footwear. The experimental results suggest high efficiency and reliability of the proposed approach.
ERIC Educational Resources Information Center
Ryba, Ken; McIvor, Tom; Shakir, Maha; Paez, Di
2006-01-01
This study examined continuous automated speech recognition in the university lecture theatre. The participants were both native speakers of English (L1) and English as a second language students (L2) enrolled in an information systems course (Total N=160). After an initial training period, an L2 lecturer in information systems delivered three…
Influences of High and Low Variability on Infant Word Recognition
ERIC Educational Resources Information Center
Singh, Leher
2008-01-01
Although infants begin to encode and track novel words in fluent speech by 7.5 months, their ability to recognize words is somewhat limited at this stage. In particular, when the surface form of a word is altered, by changing the gender or affective prosody of the speaker, infants begin to falter at spoken word recognition. Given that natural…
Digital signal processing algorithms for automatic voice recognition
NASA Technical Reports Server (NTRS)
Botros, Nazeih M.
1987-01-01
The current digital signal analysis algorithms are investigated that are implemented in automatic voice recognition algorithms. Automatic voice recognition means, the capability of a computer to recognize and interact with verbal commands. The digital signal is focused on, rather than the linguistic, analysis of speech signal. Several digital signal processing algorithms are available for voice recognition. Some of these algorithms are: Linear Predictive Coding (LPC), Short-time Fourier Analysis, and Cepstrum Analysis. Among these algorithms, the LPC is the most widely used. This algorithm has short execution time and do not require large memory storage. However, it has several limitations due to the assumptions used to develop it. The other 2 algorithms are frequency domain algorithms with not many assumptions, but they are not widely implemented or investigated. However, with the recent advances in the digital technology, namely signal processors, these 2 frequency domain algorithms may be investigated in order to implement them in voice recognition. This research is concerned with real time, microprocessor based recognition algorithms.
Identification and tracking of particular speaker in noisy environment
NASA Astrophysics Data System (ADS)
Sawada, Hideyuki; Ohkado, Minoru
2004-10-01
Human is able to exchange information smoothly using voice under different situations such as noisy environment in a crowd and with the existence of plural speakers. We are able to detect the position of a source sound in 3D space, extract a particular sound from mixed sounds, and recognize who is talking. By realizing this mechanism with a computer, new applications will be presented for recording a sound with high quality by reducing noise, presenting a clarified sound, and realizing a microphone-free speech recognition by extracting particular sound. The paper will introduce a realtime detection and identification of particular speaker in noisy environment using a microphone array based on the location of a speaker and the individual voice characteristics. The study will be applied to develop an adaptive auditory system of a mobile robot which collaborates with a factory worker.
2012-03-01
with each SVM discriminating between a pair of the N total speakers in the data set. The (( + 1))/2 classifiers then vote on the final...classification of a test sample. The Random Forest classifier is an ensemble classifier that votes amongst decision trees generated with each node using...Forest vote , and the effects of overtraining will be mitigated by the fact that each decision tree is overtrained differently (due to the random
Blinded by taboo words in L1 but not L2.
Colbeck, Katie L; Bowers, Jeffrey S
2012-04-01
The present study compares the emotionality of English taboo words in native English speakers and native Chinese speakers who learned English as a second language. Neutral and taboo/sexual words were included in a Rapid Serial Visual Presentation (RSVP) task as to-be-ignored distracters in a short- and long-lag condition. Compared with neutral distracters, taboo/sexual distracters impaired the performance in the short-lag condition only. Of critical note, however, is that the performance of Chinese speakers was less impaired by taboo/sexual distracters. This supports the view that a first language is more emotional than a second language, even when words are processed quickly and automatically. (PsycINFO Database Record (c) 2012 APA, all rights reserved).
AdjScales: Visualizing Differences between Adjectives for Language Learners
NASA Astrophysics Data System (ADS)
Sheinman, Vera; Tokunaga, Takenobu
In this study we introduce AdjScales, a method for scaling similar adjectives by their strength. It combines existing Web-based computational linguistic techniques in order to automatically differentiate between similar adjectives that describe the same property by strength. Though this kind of information is rarely present in most of the lexical resources and dictionaries, it may be useful for language learners that try to distinguish between similar words. Additionally, learners might gain from a simple visualization of these differences using unidimensional scales. The method is evaluated by comparison with annotation on a subset of adjectives from WordNet by four native English speakers. It is also compared against two non-native speakers of English. The collected annotation is an interesting resource in its own right. This work is a first step toward automatic differentiation of meaning between similar words for language learners. AdjScales can be useful for lexical resource enhancement.
Voice input/output capabilities at Perception Technology Corporation
NASA Technical Reports Server (NTRS)
Ferber, Leon A.
1977-01-01
Condensed resumes of key company personnel at the Perception Technology Corporation are presented. The staff possesses recognition, speech synthesis, speaker authentication, and language identification. Hardware and software engineers' capabilities are included.
Shape and texture fused recognition of flying targets
NASA Astrophysics Data System (ADS)
Kovács, Levente; Utasi, Ákos; Kovács, Andrea; Szirányi, Tamás
2011-06-01
This paper presents visual detection and recognition of flying targets (e.g. planes, missiles) based on automatically extracted shape and object texture information, for application areas like alerting, recognition and tracking. Targets are extracted based on robust background modeling and a novel contour extraction approach, and object recognition is done by comparisons to shape and texture based query results on a previously gathered real life object dataset. Application areas involve passive defense scenarios, including automatic object detection and tracking with cheap commodity hardware components (CPU, camera and GPS).
INTERPOL survey of the use of speaker identification by law enforcement agencies.
Morrison, Geoffrey Stewart; Sahito, Farhan Hyder; Jardine, Gaëlle; Djokic, Djordje; Clavet, Sophie; Berghs, Sabine; Goemans Dorny, Caroline
2016-06-01
A survey was conducted of the use of speaker identification by law enforcement agencies around the world. A questionnaire was circulated to law enforcement agencies in the 190 member countries of INTERPOL. 91 responses were received from 69 countries. 44 respondents reported that they had speaker identification capabilities in house or via external laboratories. Half of these came from Europe. 28 respondents reported that they had databases of audio recordings of speakers. The clearest pattern in the responses was that of diversity. A variety of different approaches to speaker identification were used: The human-supervised-automatic approach was the most popular in North America, the auditory-acoustic-phonetic approach was the most popular in Europe, and the spectrographic/auditory-spectrographic approach was the most popular in Africa, Asia, the Middle East, and South and Central America. Globally, and in Europe, the most popular framework for reporting conclusions was identification/exclusion/inconclusive. In Europe, the second most popular framework was the use of verbal likelihood ratio scales. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Channel Compensation for Speaker Recognition using MAP Adapted PLDA and Denoising DNNs
2016-06-21
improvement has been the availability of large quantities of speaker-labeled data from telephone recordings. For new data applications, such as audio from...mi- crophone channels to the telephone channel. Audio files were rejected if the alignment process failed. At the end of the pro- cess a total of 873...Microphone 01 AT3035 ( Audio Technica Studio Mic) 02 MX418S (Shure Gooseneck Mic) 03 Crown PZM Soundgrabber II 04 AT Pro45 ( Audio Technica Hanging Mic
ERIC Educational Resources Information Center
Nober, E. Harris; Seymour, Harry N.
In order to investigate the possible consequences of dialectical differences in the classroom setting relative to the low income black and white first grade child and the prospective white middle-class teacher, 25 black and 25 white university listeners yielded speech recognition scores for 48 black and 48 white five-year-old urban school-children…
A Prerequisite to L1 Homophone Effects in L2 Spoken-Word Recognition
ERIC Educational Resources Information Center
Nakai, Satsuki; Lindsay, Shane; Ota, Mitsuhiko
2015-01-01
When both members of a phonemic contrast in L2 (second language) are perceptually mapped to a single phoneme in one's L1 (first language), L2 words containing a member of that contrast can spuriously activate L2 words in spoken-word recognition. For example, upon hearing cattle, Dutch speakers of English are reported to experience activation…
Concept Recognition in an Automatic Text-Processing System for the Life Sciences.
ERIC Educational Resources Information Center
Vleduts-Stokolov, Natasha
1987-01-01
Describes a system developed for the automatic recognition of biological concepts in titles of scientific articles; reports results of several pilot experiments which tested the system's performance; analyzes typical ambiguity problems encountered by the system; describes a disambiguation technique that was developed; and discusses future plans…
Sensing of Particular Speakers for the Construction of Voice Interface Utilized in Noisy Environment
NASA Astrophysics Data System (ADS)
Sawada, Hideyuki; Ohkado, Minoru
Human is able to exchange information smoothly using voice under different situations such as noisy environment in a crowd and with the existence of plural speakers. We are able to detect the position of a source sound in 3D space, extract a particular sound from mixed sounds, and recognize who is talking. By realizing this mechanism with a computer, new applications will be presented for recording a sound with high quality by reducing noise, presenting a clarified sound, and realizing a microphone-free speech recognition by extracting particular sound. The paper will introduce a realtime detection and identification of particular speaker in noisy environment using a microphone array based on the location of a speaker and the individual voice characteristics. The study will be applied to develop an adaptive auditory system of a mobile robot which collaborates with a factory worker.
Lexical constraints in second language learning: Evidence on grammatical gender in German*
BOBB, SUSAN C.; KROLL, JUDITH F.; JACKSON, CARRIE N.
2015-01-01
The present study asked whether or not the apparent insensitivity of second language (L2) learners to grammatical gender violations reflects an inability to use grammatical information during L2 lexical processing. Native German speakers and English speakers with intermediate to advanced L2 proficiency in German performed a translation-recognition task. On critical trials, an incorrect translation was presented that either matched or mismatched the grammatical gender of the correct translation. Results show interference for native German speakers in conditions in which the incorrect translation matched the gender of the correct translation. Native English speakers, regardless of German proficiency, were insensitive to the gender mismatch. In contrast, these same participants were correctly able to assign gender to critical items. These findings suggest a dissociation between explicit knowledge and the ability to use that information under speeded processing conditions and demonstrate the difficulty of L2 gender processing at the lexical level. PMID:26346327
Speaker Recognition Through NLP and CWT Modeling
DOE Office of Scientific and Technical Information (OSTI.GOV)
Brown-VanHoozer, S.A.; Kercel, S.W.; Tucker, R.W.
The objective of this research is to develop a system capable of identifying speakers on wiretaps from a large database (>500 speakers) with a short search time duration (<30 seconds), and with better than 90% accuracy. Much previous research in speaker recognition has led to algorithms that produced encouraging preliminary results, but were overwhelmed when applied to populations of more than a dozen or so different speakers. The authors are investigating a solution to the "large population" problem by seeking two completely different kinds of characterizing features. These features are he techniques of Neuro-Linguistic Programming (NLP) and the continuous waveletmore » transform (CWT). NLP extracts precise neurological, verbal and non-verbal information, and assimilates the information into useful patterns. These patterns are based on specific cues demonstrated by each individual, and provide ways of determining congruency between verbal and non-verbal cues. The primary NLP modalities are characterized through word spotting (or verbal predicates cues, e.g., see, sound, feel, etc.) while the secondary modalities would be characterized through the speech transcription used by the individual. This has the practical effect of reducing the size of the search space, and greatly speeding up the process of identifying an unknown speaker. The wavelet-based line of investigation concentrates on using vowel phonemes and non-verbal cues, such as tempo. The rationale for concentrating on vowels is there are a limited number of vowels phonemes, and at least one of them usually appears in even the shortest of speech segments. Using the fast, CWT algorithm, the details of both the formant frequency and the glottal excitation characteristics can be easily extracted from voice waveforms. The differences in the glottal excitation waveforms as well as the formant frequency are evident in the CWT output. More significantly, the CWT reveals significant detail of the glottal excitation waveform.« less
Speaker recognition through NLP and CWT modeling.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Brown-VanHoozer, A.; Kercel, S. W.; Tucker, R. W.
The objective of this research is to develop a system capable of identifying speakers on wiretaps from a large database (>500 speakers) with a short search time duration (<30 seconds), and with better than 90% accuracy. Much previous research in speaker recognition has led to algorithms that produced encouraging preliminary results, but were overwhelmed when applied to populations of more than a dozen or so different speakers. The authors are investigating a solution to the ''huge population'' problem by seeking two completely different kinds of characterizing features. These features are extracted using the techniques of Neuro-Linguistic Programming (NLP) and themore » continuous wavelet transform (CWT). NLP extracts precise neurological, verbal and non-verbal information, and assimilates the information into useful patterns. These patterns are based on specific cues demonstrated by each individual, and provide ways of determining congruency between verbal and non-verbal cues. The primary NLP modalities are characterized through word spotting (or verbal predicates cues, e.g., see, sound, feel, etc.) while the secondary modalities would be characterized through the speech transcription used by the individual. This has the practical effect of reducing the size of the search space, and greatly speeding up the process of identifying an unknown speaker. The wavelet-based line of investigation concentrates on using vowel phonemes and non-verbal cues, such as tempo. The rationale for concentrating on vowels is there are a limited number of vowels phonemes, and at least one of them usually appears in even the shortest of speech segments. Using the fast, CWT algorithm, the details of both the formant frequency and the glottal excitation characteristics can be easily extracted from voice waveforms. The differences in the glottal excitation waveforms as well as the formant frequency are evident in the CWT output. More significantly, the CWT reveals significant detail of the glottal excitation waveform.« less
Speaker-sensitive emotion recognition via ranking: Studies on acted and spontaneous speech☆
Cao, Houwei; Verma, Ragini; Nenkova, Ani
2014-01-01
We introduce a ranking approach for emotion recognition which naturally incorporates information about the general expressivity of speakers. We demonstrate that our approach leads to substantial gains in accuracy compared to conventional approaches. We train ranking SVMs for individual emotions, treating the data from each speaker as a separate query, and combine the predictions from all rankers to perform multi-class prediction. The ranking method provides two natural benefits. It captures speaker specific information even in speaker-independent training/testing conditions. It also incorporates the intuition that each utterance can express a mix of possible emotion and that considering the degree to which each emotion is expressed can be productively exploited to identify the dominant emotion. We compare the performance of the rankers and their combination to standard SVM classification approaches on two publicly available datasets of acted emotional speech, Berlin and LDC, as well as on spontaneous emotional data from the FAU Aibo dataset. On acted data, ranking approaches exhibit significantly better performance compared to SVM classification both in distinguishing a specific emotion from all others and in multi-class prediction. On the spontaneous data, which contains mostly neutral utterances with a relatively small portion of less intense emotional utterances, ranking-based classifiers again achieve much higher precision in identifying emotional utterances than conventional SVM classifiers. In addition, we discuss the complementarity of conventional SVM and ranking-based classifiers. On all three datasets we find dramatically higher accuracy for the test items on whose prediction the two methods agree compared to the accuracy of individual methods. Furthermore on the spontaneous data the ranking and standard classification are complementary and we obtain marked improvement when we combine the two classifiers by late-stage fusion. PMID:25422534
Speaker-sensitive emotion recognition via ranking: Studies on acted and spontaneous speech☆
Cao, Houwei; Verma, Ragini; Nenkova, Ani
2015-01-01
We introduce a ranking approach for emotion recognition which naturally incorporates information about the general expressivity of speakers. We demonstrate that our approach leads to substantial gains in accuracy compared to conventional approaches. We train ranking SVMs for individual emotions, treating the data from each speaker as a separate query, and combine the predictions from all rankers to perform multi-class prediction. The ranking method provides two natural benefits. It captures speaker specific information even in speaker-independent training/testing conditions. It also incorporates the intuition that each utterance can express a mix of possible emotion and that considering the degree to which each emotion is expressed can be productively exploited to identify the dominant emotion. We compare the performance of the rankers and their combination to standard SVM classification approaches on two publicly available datasets of acted emotional speech, Berlin and LDC, as well as on spontaneous emotional data from the FAU Aibo dataset. On acted data, ranking approaches exhibit significantly better performance compared to SVM classification both in distinguishing a specific emotion from all others and in multi-class prediction. On the spontaneous data, which contains mostly neutral utterances with a relatively small portion of less intense emotional utterances, ranking-based classifiers again achieve much higher precision in identifying emotional utterances than conventional SVM classifiers. In addition, we discuss the complementarity of conventional SVM and ranking-based classifiers. On all three datasets we find dramatically higher accuracy for the test items on whose prediction the two methods agree compared to the accuracy of individual methods. Furthermore on the spontaneous data the ranking and standard classification are complementary and we obtain marked improvement when we combine the two classifiers by late-stage fusion.
The search for common ground: Part I. Lexical performance by linguistically diverse learners.
Windsor, Jennifer; Kohnert, Kathryn
2004-08-01
This study examines lexical performance by 3 groups of linguistically diverse school-age learners: English-only speakers with primary language impairment (LI), typical English-only speakers (EO), and typical bilingual Spanish-English speakers (BI). The accuracy and response time (RT) of 100 8- to 13-year-old children in word recognition and picture-naming tasks were analyzed. Within each task, stimulus difficulty was manipulated to include very easy stimuli (words that were high frequency/had an early age of acquisition in English) and more difficult stimuli (words of low frequency/late age of acquisition [AOA]). There was no difference among groups in real-word recognition accuracy or RT; all 3 groups showed lower accuracy with low-frequency words. In picture naming, all 3 groups showed a longer RT for words with a late AOA, although AOA had a disproportionate negative impact on BI performance. The EO group was faster and more accurate than both LI and BI groups in conditions with later acquired stimuli. Results are discussed in terms of quantitative differences separating EO children from the other 2 groups and qualitative similarities linking monolingual children with and without LI.
Improving Speaker Recognition by Biometric Voice Deconstruction
Mazaira-Fernandez, Luis Miguel; Álvarez-Marquina, Agustín; Gómez-Vilda, Pedro
2015-01-01
Person identification, especially in critical environments, has always been a subject of great interest. However, it has gained a new dimension in a world threatened by a new kind of terrorism that uses social networks (e.g., YouTube) to broadcast its message. In this new scenario, classical identification methods (such as fingerprints or face recognition) have been forcedly replaced by alternative biometric characteristics such as voice, as sometimes this is the only feature available. The present study benefits from the advances achieved during last years in understanding and modeling voice production. The paper hypothesizes that a gender-dependent characterization of speakers combined with the use of a set of features derived from the components, resulting from the deconstruction of the voice into its glottal source and vocal tract estimates, will enhance recognition rates when compared to classical approaches. A general description about the main hypothesis and the methodology followed to extract the gender-dependent extended biometric parameters is given. Experimental validation is carried out both on a highly controlled acoustic condition database, and on a mobile phone network recorded under non-controlled acoustic conditions. PMID:26442245
Improving Speaker Recognition by Biometric Voice Deconstruction.
Mazaira-Fernandez, Luis Miguel; Álvarez-Marquina, Agustín; Gómez-Vilda, Pedro
2015-01-01
Person identification, especially in critical environments, has always been a subject of great interest. However, it has gained a new dimension in a world threatened by a new kind of terrorism that uses social networks (e.g., YouTube) to broadcast its message. In this new scenario, classical identification methods (such as fingerprints or face recognition) have been forcedly replaced by alternative biometric characteristics such as voice, as sometimes this is the only feature available. The present study benefits from the advances achieved during last years in understanding and modeling voice production. The paper hypothesizes that a gender-dependent characterization of speakers combined with the use of a set of features derived from the components, resulting from the deconstruction of the voice into its glottal source and vocal tract estimates, will enhance recognition rates when compared to classical approaches. A general description about the main hypothesis and the methodology followed to extract the gender-dependent extended biometric parameters is given. Experimental validation is carried out both on a highly controlled acoustic condition database, and on a mobile phone network recorded under non-controlled acoustic conditions.
Optimization of multilayer neural network parameters for speaker recognition
NASA Astrophysics Data System (ADS)
Tovarek, Jaromir; Partila, Pavol; Rozhon, Jan; Voznak, Miroslav; Skapa, Jan; Uhrin, Dominik; Chmelikova, Zdenka
2016-05-01
This article discusses the impact of multilayer neural network parameters for speaker identification. The main task of speaker identification is to find a specific person in the known set of speakers. It means that the voice of an unknown speaker (wanted person) belongs to a group of reference speakers from the voice database. One of the requests was to develop the text-independent system, which means to classify wanted person regardless of content and language. Multilayer neural network has been used for speaker identification in this research. Artificial neural network (ANN) needs to set parameters like activation function of neurons, steepness of activation functions, learning rate, the maximum number of iterations and a number of neurons in the hidden and output layers. ANN accuracy and validation time are directly influenced by the parameter settings. Different roles require different settings. Identification accuracy and ANN validation time were evaluated with the same input data but different parameter settings. The goal was to find parameters for the neural network with the highest precision and shortest validation time. Input data of neural networks are a Mel-frequency cepstral coefficients (MFCC). These parameters describe the properties of the vocal tract. Audio samples were recorded for all speakers in a laboratory environment. Training, testing and validation data set were split into 70, 15 and 15 %. The result of the research described in this article is different parameter setting for the multilayer neural network for four speakers.
Fuzzy Logic-Based Audio Pattern Recognition
NASA Astrophysics Data System (ADS)
Malcangi, M.
2008-11-01
Audio and audio-pattern recognition is becoming one of the most important technologies to automatically control embedded systems. Fuzzy logic may be the most important enabling methodology due to its ability to rapidly and economically model such application. An audio and audio-pattern recognition engine based on fuzzy logic has been developed for use in very low-cost and deeply embedded systems to automate human-to-machine and machine-to-machine interaction. This engine consists of simple digital signal-processing algorithms for feature extraction and normalization, and a set of pattern-recognition rules manually tuned or automatically tuned by a self-learning process.
Automatic violence detection in digital movies
NASA Astrophysics Data System (ADS)
Fischer, Stephan
1996-11-01
Research on computer-based recognition of violence is scant. We are working on the automatic recognition of violence in digital movies, a first step towards the goal of a computer- assisted system capable of protecting children against TV programs containing a great deal of violence. In the video domain a collision detection and a model-mapping to locate human figures are run, while the creation and comparison of fingerprints to find certain events are run int he audio domain. This article centers on the recognition of fist- fights in the video domain and on the recognition of shots, explosions and cries in the audio domain.
ERIC Educational Resources Information Center
Chen, Howard Hao-Jan
2011-01-01
Oral communication ability has become increasingly important to many EFL students. Several commercial software programs based on automatic speech recognition (ASR) technologies are available but their prices are not affordable for many students. This paper will demonstrate how the Microsoft Speech Application Software Development Kit (SASDK), a…
Automatic speech recognition in air traffic control
NASA Technical Reports Server (NTRS)
Karlsson, Joakim
1990-01-01
Automatic Speech Recognition (ASR) technology and its application to the Air Traffic Control system are described. The advantages of applying ASR to Air Traffic Control, as well as criteria for choosing a suitable ASR system are presented. Results from previous research and directions for future work at the Flight Transportation Laboratory are outlined.
Automatic Speech Recognition: Reliability and Pedagogical Implications for Teaching Pronunciation
ERIC Educational Resources Information Center
Kim, In-Seok
2006-01-01
This study examines the reliability of automatic speech recognition (ASR) software used to teach English pronunciation, focusing on one particular piece of software, "FluSpeak, as a typical example." Thirty-six Korean English as a Foreign Language (EFL) college students participated in an experiment in which they listened to 15 sentences…
Automatic Speech Recognition Technology as an Effective Means for Teaching Pronunciation
ERIC Educational Resources Information Center
Elimat, Amal Khalil; AbuSeileek, Ali Farhan
2014-01-01
This study aimed to explore the effect of using automatic speech recognition technology (ASR) on the third grade EFL students' performance in pronunciation, whether teaching pronunciation through ASR is better than regular instruction, and the most effective teaching technique (individual work, pair work, or group work) in teaching pronunciation…
Automatization and Orthographic Development in Second Language Visual Word Recognition
ERIC Educational Resources Information Center
Kida, Shusaku
2016-01-01
The present study investigated second language (L2) learners' acquisition of automatic word recognition and the development of L2 orthographic representation in the mental lexicon. Participants in the study were Japanese university students enrolled in a compulsory course involving a weekly 30-minute sustained silent reading (SSR) activity with…
Evaluating Automatic Speech Recognition-Based Language Learning Systems: A Case Study
ERIC Educational Resources Information Center
van Doremalen, Joost; Boves, Lou; Colpaert, Jozef; Cucchiarini, Catia; Strik, Helmer
2016-01-01
The purpose of this research was to evaluate a prototype of an automatic speech recognition (ASR)-based language learning system that provides feedback on different aspects of speaking performance (pronunciation, morphology and syntax) to students of Dutch as a second language. We carried out usability reviews, expert reviews and user tests to…
Factor analysis of auto-associative neural networks with application in speaker verification.
Garimella, Sri; Hermansky, Hynek
2013-04-01
Auto-associative neural network (AANN) is a fully connected feed-forward neural network, trained to reconstruct its input at its output through a hidden compression layer, which has fewer numbers of nodes than the dimensionality of input. AANNs are used to model speakers in speaker verification, where a speaker-specific AANN model is obtained by adapting (or retraining) the universal background model (UBM) AANN, an AANN trained on multiple held out speakers, using corresponding speaker data. When the amount of speaker data is limited, this adaptation procedure may lead to overfitting as all the parameters of UBM-AANN are adapted. In this paper, we introduce and develop the factor analysis theory of AANNs to alleviate this problem. We hypothesize that only the weight matrix connecting the last nonlinear hidden layer and the output layer is speaker-specific, and further restrict it to a common low-dimensional subspace during adaptation. The subspace is learned using large amounts of development data, and is held fixed during adaptation. Thus, only the coordinates in a subspace, also known as i-vector, need to be estimated using speaker-specific data. The update equations are derived for learning both the common low-dimensional subspace and the i-vectors corresponding to speakers in the subspace. The resultant i-vector representation is used as a feature for the probabilistic linear discriminant analysis model. The proposed system shows promising results on the NIST-08 speaker recognition evaluation (SRE), and yields a 23% relative improvement in equal error rate over the previously proposed weighted least squares-based subspace AANNs system. The experiments on NIST-10 SRE confirm that these improvements are consistent and generalize across datasets.
ERIC Educational Resources Information Center
Ashrapova, Alsu; Alendeeva, Svetlana
2014-01-01
This article is the result of a study of the influence of English and German on the Russian language during the English learning based on lexical borrowings in the field of economics. This paper discusses the use and recognition of borrowings from the English and German languages by Russian native speakers. The use of lexical borrowings from…
Speech recognition: Acoustic-phonetic knowledge acquisition and representation
NASA Astrophysics Data System (ADS)
Zue, Victor W.
1988-09-01
The long-term research goal is to develop and implement speaker-independent continuous speech recognition systems. It is believed that the proper utilization of speech-specific knowledge is essential for such advanced systems. This research is thus directed toward the acquisition, quantification, and representation, of acoustic-phonetic and lexical knowledge, and the application of this knowledge to speech recognition algorithms. In addition, we are exploring new speech recognition alternatives based on artificial intelligence and connectionist techniques. We developed a statistical model for predicting the acoustic realization of stop consonants in various positions in the syllable template. A unification-based grammatical formalism was developed for incorporating this model into the lexical access algorithm. We provided an information-theoretic justification for the hierarchical structure of the syllable template. We analyzed segmented duration for vowels and fricatives in continuous speech. Based on contextual information, we developed durational models for vowels and fricatives that account for over 70 percent of the variance, using data from multiple, unknown speakers. We rigorously evaluated the ability of human spectrogram readers to identify stop consonants spoken by many talkers and in a variety of phonetic contexts. Incorporating the declarative knowledge used by the readers, we developed a knowledge-based system for stop identification. We achieved comparable system performance to that to the readers.
Robust recognition of loud and Lombard speech in the fighter cockpit environment
NASA Astrophysics Data System (ADS)
Stanton, Bill J., Jr.
1988-08-01
There are a number of challenges associated with incorporating speech recognition technology into the fighter cockpit. One of the major problems is the wide range of variability in the pilot's voice. That can result from changing levels of stress and workload. Increasing the training set to include abnormal speech is not an attractive option because of the innumerable conditions that would have to be represented and the inordinate amount of time to collect such a training set. A more promising approach is to study subsets of abnormal speech that have been produced under controlled cockpit conditions with the purpose of characterizing reliable shifts that occur relative to normal speech. Such was the initiative of this research. Analyses were conducted for 18 features on 17671 phoneme tokens across eight speakers for normal, loud, and Lombard speech. It was discovered that there was a consistent migration of energy in the sonorants. This discovery of reliable energy shifts led to the development of a method to reduce or eliminate these shifts in the Euclidean distances between LPC log magnitude spectra. This combination significantly improved recognition performance of loud and Lombard speech. Discrepancies in recognition error rates between normal and abnormal speech were reduced by approximately 50 percent for all eight speakers combined.
Perception of relative location of F0 within the speaking range
NASA Astrophysics Data System (ADS)
Honorof, Douglas N.; Whalen, D. H.
2003-10-01
It has been argued that intrinsic fundamental frequency (IF0) is an automatic consequence of vowel production [Whalen et al., J. Phon. 27, 125-142 (1999)], yet speakers do not adjust F0 so as to overcome IF0. It may be that so adjusting F0 would distort information about F0 range-information important to the interpretation of F0. Therefore, a speech production/perception experiment was designed to determine whether listeners can perceive position within a speaker-specific F0 range on the basis of isolated tokens. Ten male and ten female adult native speakers of US English were recorded speaking (not singing) the vowel /a/ on eight different pitches spaced throughout speaker-specific ranges. Recordings were randomized across speakers. Naive listeners made pitch-magnitude estimates of the location of F0 relative to each speaker's range. Preliminary results show correlations between estimated and actual location within the range. Adjusting F0 to compensate for IF0 differences between vowels would seem to obscure voice quality in such a way as to make it difficult for the listener to recover relative F0, requiring a greater perceptual adjustment than simply normalizing for IF0. [Work supported by NIH Grant No. DC02717.
Evolving Spiking Neural Networks for Recognition of Aged Voices.
Silva, Marco; Vellasco, Marley M B R; Cataldo, Edson
2017-01-01
The aging of the voice, known as presbyphonia, is a natural process that can cause great change in vocal quality of the individual. This is a relevant problem to those people who use their voices professionally, and its early identification can help determine a suitable treatment to avoid its progress or even to eliminate the problem. This work focuses on the development of a new model for the identification of aging voices (independently of their chronological age), using as input attributes parameters extracted from the voice and glottal signals. The proposed model, named Quantum binary-real evolving Spiking Neural Network (QbrSNN), is based on spiking neural networks (SNNs), with an unsupervised training algorithm, and a Quantum-Inspired Evolutionary Algorithm that automatically determines the most relevant attributes and the optimal parameters that configure the SNN. The QbrSNN model was evaluated in a database composed of 120 records, containing samples from three groups of speakers. The results obtained indicate that the proposed model provides better accuracy than other approaches, with fewer input attributes. Copyright © 2017 The Voice Foundation. Published by Elsevier Inc. All rights reserved.
ERIC Educational Resources Information Center
Sidgi, Lina Fathi Sidig; Shaari, Ahmad Jelani
2017-01-01
The use of technology, such as computer-assisted language learning (CALL), is used in teaching and learning in the foreign language classrooms where it is most needed. One promising emerging technology that supports language learning is automatic speech recognition (ASR). Integrating such technology, especially in the instruction of pronunciation…
ERIC Educational Resources Information Center
Mayer, Andreas; Motsch, Hans-Joachim
2015-01-01
This study analysed the effects of a classroom intervention focusing on phonological awareness and/or automatized word recognition in children with a deficit in the domains of phonological awareness and rapid automatized naming ("double deficit"). According to the double-deficit hypothesis (Wolf & Bowers, 1999), these children belong…
Effects of a metronome on the filled pauses of fluent speakers.
Christenfeld, N
1996-12-01
Filled pauses (the "ums" and "uhs" that litter spontaneous speech) seem to be a product of the speaker paying deliberate attention to the normally automatic act of talking. This is the same sort of explanation that has been offered for stuttering. In this paper we explore whether a manipulation that has long been known to decrease stuttering, synchronizing speech to the beats of a metronome, will then also decrease filled pauses. Two experiments indicate that a metronome has a dramatic effect on the production of filled pauses. This effect is not due to any simplification or slowing of the speech and supports the view that a metronome causes speakers to attend more to how they are talking and less to what they are saying. It also lends support to the connection between stutters and filled pauses.
Face Recognition From One Example View.
1995-09-01
Proceedings, International Workshop on Automatic Face- and Gesture-Recognition, pages 248{253, Zurich, 1995. [32] Yael Moses, Shimon Ullman, and Shimon...recognition. Journal of Cognitive Neuroscience, 3(1):71{86, 1991. [49] Shimon Ullman and Ronen Basri. Recognition by linear combinations of models
Rapid Extraction of Lexical Tone Phonology in Chinese Characters: A Visual Mismatch Negativity Study
Wang, Xiao-Dong; Liu, A-Ping; Wu, Yin-Yuan; Wang, Peng
2013-01-01
Background In alphabetic languages, emerging evidence from behavioral and neuroimaging studies shows the rapid and automatic activation of phonological information in visual word recognition. In the mapping from orthography to phonology, unlike most alphabetic languages in which there is a natural correspondence between the visual and phonological forms, in logographic Chinese, the mapping between visual and phonological forms is rather arbitrary and depends on learning and experience. The issue of whether the phonological information is rapidly and automatically extracted in Chinese characters by the brain has not yet been thoroughly addressed. Methodology/Principal Findings We continuously presented Chinese characters differing in orthography and meaning to adult native Mandarin Chinese speakers to construct a constant varying visual stream. In the stream, most stimuli were homophones of Chinese characters: The phonological features embedded in these visual characters were the same, including consonants, vowels and the lexical tone. Occasionally, the rule of phonology was randomly violated by characters whose phonological features differed in the lexical tone. Conclusions/Significance We showed that the violation of the lexical tone phonology evoked an early, robust visual response, as revealed by whole-head electrical recordings of the visual mismatch negativity (vMMN), indicating the rapid extraction of phonological information embedded in Chinese characters. Source analysis revealed that the vMMN was involved in neural activations of the visual cortex, suggesting that the visual sensory memory is sensitive to phonological information embedded in visual words at an early processing stage. PMID:23437235
Automatic Target Recognition Based on Cross-Plot
Wong, Kelvin Kian Loong; Abbott, Derek
2011-01-01
Automatic target recognition that relies on rapid feature extraction of real-time target from photo-realistic imaging will enable efficient identification of target patterns. To achieve this objective, Cross-plots of binary patterns are explored as potential signatures for the observed target by high-speed capture of the crucial spatial features using minimal computational resources. Target recognition was implemented based on the proposed pattern recognition concept and tested rigorously for its precision and recall performance. We conclude that Cross-plotting is able to produce a digital fingerprint of a target that correlates efficiently and effectively to signatures of patterns having its identity in a target repository. PMID:21980508
Presentation video retrieval using automatically recovered slide and spoken text
NASA Astrophysics Data System (ADS)
Cooper, Matthew
2013-03-01
Video is becoming a prevalent medium for e-learning. Lecture videos contain text information in both the presentation slides and lecturer's speech. This paper examines the relative utility of automatically recovered text from these sources for lecture video retrieval. To extract the visual information, we automatically detect slides within the videos and apply optical character recognition to obtain their text. Automatic speech recognition is used similarly to extract spoken text from the recorded audio. We perform controlled experiments with manually created ground truth for both the slide and spoken text from more than 60 hours of lecture video. We compare the automatically extracted slide and spoken text in terms of accuracy relative to ground truth, overlap with one another, and utility for video retrieval. Results reveal that automatically recovered slide text and spoken text contain different content with varying error profiles. Experiments demonstrate that automatically extracted slide text enables higher precision video retrieval than automatically recovered spoken text.
Image-based automatic recognition of larvae
NASA Astrophysics Data System (ADS)
Sang, Ru; Yu, Guiying; Fan, Weijun; Guo, Tiantai
2010-08-01
As the main objects, imagoes have been researched in quarantine pest recognition in these days. However, pests in their larval stage are latent, and the larvae spread abroad much easily with the circulation of agricultural and forest products. It is presented in this paper that, as the new research objects, larvae are recognized by means of machine vision, image processing and pattern recognition. More visional information is reserved and the recognition rate is improved as color image segmentation is applied to images of larvae. Along with the characteristics of affine invariance, perspective invariance and brightness invariance, scale invariant feature transform (SIFT) is adopted for the feature extraction. The neural network algorithm is utilized for pattern recognition, and the automatic identification of larvae images is successfully achieved with satisfactory results.
Offline Arabic handwriting recognition: a survey.
Lorigo, Liana M; Govindaraju, Venu
2006-05-01
The automatic recognition of text on scanned images has enabled many applications such as searching for words in large volumes of documents, automatic sorting of postal mail, and convenient editing of previously printed documents. The domain of handwriting in the Arabic script presents unique technical challenges and has been addressed more recently than other domains. Many different methods have been proposed and applied to various types of images. This paper provides a comprehensive review of these methods. It is the first survey to focus on Arabic handwriting recognition and the first Arabic character recognition survey to provide recognition rates and descriptions of test data for the approaches discussed. It includes background on the field, discussion of the methods, and future research directions.
Automatic face recognition in HDR imaging
NASA Astrophysics Data System (ADS)
Pereira, Manuela; Moreno, Juan-Carlos; Proença, Hugo; Pinheiro, António M. G.
2014-05-01
The gaining popularity of the new High Dynamic Range (HDR) imaging systems is raising new privacy issues caused by the methods used for visualization. HDR images require tone mapping methods for an appropriate visualization on conventional and non-expensive LDR displays. These visualization methods might result in completely different visualization raising several issues on privacy intrusion. In fact, some visualization methods result in a perceptual recognition of the individuals, while others do not even show any identity. Although perceptual recognition might be possible, a natural question that can rise is how computer based recognition will perform using tone mapping generated images? In this paper, a study where automatic face recognition using sparse representation is tested with images that result from common tone mapping operators applied to HDR images. Its ability for the face identity recognition is described. Furthermore, typical LDR images are used for the face recognition training.
Container-code recognition system based on computer vision and deep neural networks
NASA Astrophysics Data System (ADS)
Liu, Yi; Li, Tianjian; Jiang, Li; Liang, Xiaoyao
2018-04-01
Automatic container-code recognition system becomes a crucial requirement for ship transportation industry in recent years. In this paper, an automatic container-code recognition system based on computer vision and deep neural networks is proposed. The system consists of two modules, detection module and recognition module. The detection module applies both algorithms based on computer vision and neural networks, and generates a better detection result through combination to avoid the drawbacks of the two methods. The combined detection results are also collected for online training of the neural networks. The recognition module exploits both character segmentation and end-to-end recognition, and outputs the recognition result which passes the verification. When the recognition module generates false recognition, the result will be corrected and collected for online training of the end-to-end recognition sub-module. By combining several algorithms, the system is able to deal with more situations, and the online training mechanism can improve the performance of the neural networks at runtime. The proposed system is able to achieve 93% of overall recognition accuracy.
Automatic forensic face recognition from digital images.
Peacock, C; Goode, A; Brett, A
2004-01-01
Digital image evidence is now widely available from criminal investigations and surveillance operations, often captured by security and surveillance CCTV. This has resulted in a growing demand from law enforcement agencies for automatic person-recognition based on image data. In forensic science, a fundamental requirement for such automatic face recognition is to evaluate the weight that can justifiably be attached to this recognition evidence in a scientific framework. This paper describes a pilot study carried out by the Forensic Science Service (UK) which explores the use of digital facial images in forensic investigation. For the purpose of the experiment a specific software package was chosen (Image Metrics Optasia). The paper does not describe the techniques used by the software to reach its decision of probabilistic matches to facial images, but accepts the output of the software as though it were a 'black box'. In this way, the paper lays a foundation for how face recognition systems can be compared in a forensic framework. The aim of the paper is to explore how reliably and under what conditions digital facial images can be presented in evidence.
Automatic event recognition and anomaly detection with attribute grammar by learning scene semantics
NASA Astrophysics Data System (ADS)
Qi, Lin; Yao, Zhenyu; Li, Li; Dong, Junyu
2007-11-01
In this paper we present a novel framework for automatic event recognition and abnormal behavior detection with attribute grammar by learning scene semantics. This framework combines learning scene semantics by trajectory analysis and constructing attribute grammar-based event representation. The scene and event information is learned automatically. Abnormal behaviors that disobey scene semantics or event grammars rules are detected. By this method, an approach to understanding video scenes is achieved. Further more, with this prior knowledge, the accuracy of abnormal event detection is increased.
Automatic concept extraction from spoken medical reports.
Happe, André; Pouliquen, Bruno; Burgun, Anita; Cuggia, Marc; Le Beux, Pierre
2003-07-01
The objective of this project is to investigate methods whereby a combination of speech recognition and automated indexing methods substitute for current transcription and indexing practices. We based our study on existing speech recognition software programs and on NOMINDEX, a tool that extracts MeSH concepts from medical text in natural language and that is mainly based on a French medical lexicon and on the UMLS. For each document, the process consists of three steps: (1) dictation and digital audio recording, (2) speech recognition, (3) automatic indexing. The evaluation consisted of a comparison between the set of concepts extracted by NOMINDEX after the speech recognition phase and the set of keywords manually extracted from the initial document. The method was evaluated on a set of 28 patient discharge summaries extracted from the MENELAS corpus in French, corresponding to in-patients admitted for coronarography. The overall precision was 73% and the overall recall was 90%. Indexing errors were mainly due to word sense ambiguity and abbreviations. A specific issue was the fact that the standard French translation of MeSH terms lacks diacritics. A preliminary evaluation of speech recognition tools showed that the rate of accurate recognition was higher than 98%. Only 3% of the indexing errors were generated by inadequate speech recognition. We discuss several areas to focus on to improve this prototype. However, the very low rate of indexing errors due to speech recognition errors highlights the potential benefits of combining speech recognition techniques and automatic indexing.
ERIC Educational Resources Information Center
Sauval, Karinne; Perre, Laetitia; Casalis, Séverine
2017-01-01
The present study aimed to investigate the development of automatic phonological processes involved in visual word recognition during reading acquisition in French. A visual masked priming lexical decision experiment was carried out with third, fifth graders and adult skilled readers. Three different types of partial overlap between the prime and…
ERIC Educational Resources Information Center
Fontan, Lionel; Ferrané, Isabelle; Farinas, Jérôme; Pinquier, Julien; Tardieu, Julien; Magnen, Cynthia; Gaillard, Pascal; Aumont, Xavier; Füllgrabe, Christian
2017-01-01
Purpose: The purpose of this article is to assess speech processing for listeners with simulated age-related hearing loss (ARHL) and to investigate whether the observed performance can be replicated using an automatic speech recognition (ASR) system. The long-term goal of this research is to develop a system that will assist…
ERIC Educational Resources Information Center
Saadatzi, Mohammad Nasser; Pennington, Robert C.; Welch, Karla C.; Graham, James H.; Scott, Renee E.
2017-01-01
In the current study, we examined the effects of an instructional package comprised of an autonomous pedagogical agent, automatic speech recognition, and constant time delay during the instruction of reading sight words aloud to young adults with autism spectrum disorder. We used a concurrent multiple baseline across participants design to…
ERIC Educational Resources Information Center
Wald, Mike
2006-01-01
The potential use of Automatic Speech Recognition to assist receptive communication is explored. The opportunities and challenges that this technology presents students and staff to provide captioning of speech online or in classrooms for deaf or hard of hearing students and assist blind, visually impaired or dyslexic learners to read and search…
Vocal Identity Recognition in Autism Spectrum Disorder
Lin, I-Fan; Yamada, Takashi; Komine, Yoko; Kato, Nobumasa; Kato, Masaharu; Kashino, Makio
2015-01-01
Voices can convey information about a speaker. When forming an abstract representation of a speaker, it is important to extract relevant features from acoustic signals that are invariant to the modulation of these signals. This study investigated the way in which individuals with autism spectrum disorder (ASD) recognize and memorize vocal identity. The ASD group and control group performed similarly in a task when asked to choose the name of the newly-learned speaker based on his or her voice, and the ASD group outperformed the control group in a subsequent familiarity test when asked to discriminate the previously trained voices and untrained voices. These findings suggest that individuals with ASD recognized and memorized voices as well as the neurotypical individuals did, but they categorized voices in a different way: individuals with ASD categorized voices quantitatively based on the exact acoustic features, while neurotypical individuals categorized voices qualitatively based on the acoustic patterns correlated to the speakers' physical and mental properties. PMID:26070199
Vocal Identity Recognition in Autism Spectrum Disorder.
Lin, I-Fan; Yamada, Takashi; Komine, Yoko; Kato, Nobumasa; Kato, Masaharu; Kashino, Makio
2015-01-01
Voices can convey information about a speaker. When forming an abstract representation of a speaker, it is important to extract relevant features from acoustic signals that are invariant to the modulation of these signals. This study investigated the way in which individuals with autism spectrum disorder (ASD) recognize and memorize vocal identity. The ASD group and control group performed similarly in a task when asked to choose the name of the newly-learned speaker based on his or her voice, and the ASD group outperformed the control group in a subsequent familiarity test when asked to discriminate the previously trained voices and untrained voices. These findings suggest that individuals with ASD recognized and memorized voices as well as the neurotypical individuals did, but they categorized voices in a different way: individuals with ASD categorized voices quantitatively based on the exact acoustic features, while neurotypical individuals categorized voices qualitatively based on the acoustic patterns correlated to the speakers' physical and mental properties.
Multiple Metaphors: Teaching Tense and Aspect to English-Speakers.
ERIC Educational Resources Information Center
Cody, Karen
2000-01-01
This paper proposes a synthesis of instructional methods from both traditional/explicit grammar and learner-centered/constructivist camps that also incorporates many types of metaphors (abstract, visual, and kinesthetic) in order to lead learners from declarative to proceduralized to automatized knowledge. This integrative, synthetic approach…
Paats, A; Alumäe, T; Meister, E; Fridolin, I
2018-04-30
The aim of this study was to analyze retrospectively the influence of different acoustic and language models in order to determine the most important effects to the clinical performance of an Estonian language-based non-commercial radiology-oriented automatic speech recognition (ASR) system. An ASR system was developed for Estonian language in radiology domain by utilizing open-source software components (Kaldi toolkit, Thrax). The ASR system was trained with the real radiology text reports and dictations collected during development phases. The final version of the ASR system was tested by 11 radiologists who dictated 219 reports in total, in spontaneous manner in a real clinical environment. The audio files collected in the final phase were used to measure the performance of different versions of the ASR system retrospectively. ASR system versions were evaluated by word error rate (WER) for each speaker and modality and by WER difference for the first and the last version of the ASR system. Total average WER for the final version throughout all material was improved from 18.4% of the first version (v1) to 5.8% of the last (v8) version which corresponds to relative improvement of 68.5%. WER improvement was strongly related to modality and radiologist. In summary, the performance of the final ASR system version was close to optimal, delivering similar results to all modalities and being independent on user, the complexity of the radiology reports, user experience, and speech characteristics.
NASA Astrophysics Data System (ADS)
Costache, G. N.; Gavat, I.
2004-09-01
Along with the aggressive growing of the amount of digital data available (text, audio samples, digital photos and digital movies joined all in the multimedia domain) the need for classification, recognition and retrieval of this kind of data became very important. In this paper will be presented a system structure to handle multimedia data based on a recognition perspective. The main processing steps realized for the interesting multimedia objects are: first, the parameterization, by analysis, in order to obtain a description based on features, forming the parameter vector; second, a classification, generally with a hierarchical structure to make the necessary decisions. For audio signals, both speech and music, the derived perceptual features are the melcepstral (MFCC) and the perceptual linear predictive (PLP) coefficients. For images, the derived features are the geometric parameters of the speaker mouth. The hierarchical classifier consists generally in a clustering stage, based on the Kohonnen Self-Organizing Maps (SOM) and a final stage, based on a powerful classification algorithm called Support Vector Machines (SVM). The system, in specific variants, is applied with good results in two tasks: the first, is a bimodal speech recognition which uses features obtained from speech signal fused to features obtained from speaker's image and the second is a music retrieval from large music database.
Unvoiced Speech Recognition Using Tissue-Conductive Acoustic Sensor
NASA Astrophysics Data System (ADS)
Heracleous, Panikos; Kaino, Tomomi; Saruwatari, Hiroshi; Shikano, Kiyohiro
2006-12-01
We present the use of stethoscope and silicon NAM (nonaudible murmur) microphones in automatic speech recognition. NAM microphones are special acoustic sensors, which are attached behind the talker's ear and can capture not only normal (audible) speech, but also very quietly uttered speech (nonaudible murmur). As a result, NAM microphones can be applied in automatic speech recognition systems when privacy is desired in human-machine communication. Moreover, NAM microphones show robustness against noise and they might be used in special systems (speech recognition, speech transform, etc.) for sound-impaired people. Using adaptation techniques and a small amount of training data, we achieved for a 20 k dictation task a[InlineEquation not available: see fulltext.] word accuracy for nonaudible murmur recognition in a clean environment. In this paper, we also investigate nonaudible murmur recognition in noisy environments and the effect of the Lombard reflex on nonaudible murmur recognition. We also propose three methods to integrate audible speech and nonaudible murmur recognition using a stethoscope NAM microphone with very promising results.
Effect of Vowel Context on the Recognition of Initial Consonants in Kannada.
Kalaiah, Mohan Kumar; Bhat, Jayashree S
2017-09-01
The present study was carried out to investigate the effect of vowel context on the recognition of Kannada consonants in quiet for young adults. A total of 17 young adults with normal hearing in both ears participated in the study. The stimuli included consonant-vowel syllables, spoken by 12 native speakers of Kannada. Consonant recognition task was carried out as a closed-set (fourteen-alternative forced-choice). The present study showed an effect of vowel context on the perception of consonants. Maximum consonant recognition score was obtained in the /o/ vowel context, followed by the /a/ and /u/ vowel contexts, and then the /e/ context. Poorest consonant recognition score was obtained in the vowel context /i/. Vowel context has an effect on the recognition of Kannada consonants, and the vowel effect was unique for Kannada consonants.
Benefits for Voice Learning Caused by Concurrent Faces Develop over Time.
Zäske, Romi; Mühl, Constanze; Schweinberger, Stefan R
2015-01-01
Recognition of personally familiar voices benefits from the concurrent presentation of the corresponding speakers' faces. This effect of audiovisual integration is most pronounced for voices combined with dynamic articulating faces. However, it is unclear if learning unfamiliar voices also benefits from audiovisual face-voice integration or, alternatively, is hampered by attentional capture of faces, i.e., "face-overshadowing". In six study-test cycles we compared the recognition of newly-learned voices following unimodal voice learning vs. bimodal face-voice learning with either static (Exp. 1) or dynamic articulating faces (Exp. 2). Voice recognition accuracies significantly increased for bimodal learning across study-test cycles while remaining stable for unimodal learning, as reflected in numerical costs of bimodal relative to unimodal voice learning in the first two study-test cycles and benefits in the last two cycles. This was independent of whether faces were static images (Exp. 1) or dynamic videos (Exp. 2). In both experiments, slower reaction times to voices previously studied with faces compared to voices only may result from visual search for faces during memory retrieval. A general decrease of reaction times across study-test cycles suggests facilitated recognition with more speaker repetitions. Overall, our data suggest two simultaneous and opposing mechanisms during bimodal face-voice learning: while attentional capture of faces may initially impede voice learning, audiovisual integration may facilitate it thereafter.
PechaKucha Presentations: Teaching Storytelling, Visual Design, and Conciseness
ERIC Educational Resources Information Center
Lucas, Kristen; Rawlins, Jacob D.
2015-01-01
When speakers rely too heavily on presentation software templates, they often end up stultifying audiences with a triple-whammy of bullet points. In this article, Lucas and Rawlins present an alternative method--PechaKucha (the Japanese word for "chit chat")--a presentation style driven by a carefully planned, automatically timed…
Acoustic Analysis of Voice in Dysarthria following Stroke
ERIC Educational Resources Information Center
Wang, Yu-Tsai; Kent, Ray D.; Kent, Jane Finley; Duffy, Joseph R.; Thomas, Jack E.
2009-01-01
Although perceptual studies indicate the likelihood of voice disorders in persons with stroke, there have been few objective instrumental studies of voice dysfunction in dysarthria following stroke. This study reports automatic analysis of sustained vowel phonation for 61 speakers with stroke. The results show: (1) men with stroke and healthy…
Second Language Writing Classification System Based on Word-Alignment Distribution
ERIC Educational Resources Information Center
Kotani, Katsunori; Yoshimi, Takehiko
2010-01-01
The present paper introduces an automatic classification system for assisting second language (L2) writing evaluation. This system, which classifies sentences written by L2 learners as either native speaker-like or learner-like sentences, is constructed by machine learning algorithms using word-alignment distributions as classification features…
Computational Modeling of Emotions and Affect in Social-Cultural Interaction
2013-10-02
acoustic and textual information sources. Second, a cross-lingual study was performed that shed light on how human perception and automatic recognition...speech is produced, a speaker’s pitch and intonational pattern, and word usage. Better feature representation and advanced approaches were used to...recognition performance, and improved our understanding of language/cultural impact on human perception of emotion and automatic classification. • Units
How Captain Amerika uses neural networks to fight crime
NASA Technical Reports Server (NTRS)
Rogers, Steven K.; Kabrisky, Matthew; Ruck, Dennis W.; Oxley, Mark E.
1994-01-01
Artificial neural network models can make amazing computations. These models are explained along with their application in problems associated with fighting crime. Specific problems addressed are identification of people using face recognition, speaker identification, and fingerprint and handwriting analysis (biometric authentication).
On the Time Course of Vocal Emotion Recognition
Pell, Marc D.; Kotz, Sonja A.
2011-01-01
How quickly do listeners recognize emotions from a speaker's voice, and does the time course for recognition vary by emotion type? To address these questions, we adapted the auditory gating paradigm to estimate how much vocal information is needed for listeners to categorize five basic emotions (anger, disgust, fear, sadness, happiness) and neutral utterances produced by male and female speakers of English. Semantically-anomalous pseudo-utterances (e.g., The rivix jolled the silling) conveying each emotion were divided into seven gate intervals according to the number of syllables that listeners heard from sentence onset. Participants (n = 48) judged the emotional meaning of stimuli presented at each gate duration interval, in a successive, blocked presentation format. Analyses looked at how recognition of each emotion evolves as an utterance unfolds and estimated the “identification point” for each emotion. Results showed that anger, sadness, fear, and neutral expressions are recognized more accurately at short gate intervals than happiness, and particularly disgust; however, as speech unfolds, recognition of happiness improves significantly towards the end of the utterance (and fear is recognized more accurately than other emotions). When the gate associated with the emotion identification point of each stimulus was calculated, data indicated that fear (M = 517 ms), sadness (M = 576 ms), and neutral (M = 510 ms) expressions were identified from shorter acoustic events than the other emotions. These data reveal differences in the underlying time course for conscious recognition of basic emotions from vocal expressions, which should be accounted for in studies of emotional speech processing. PMID:22087275
Lu, Lingxi; Bao, Xiaohan; Chen, Jing; Qu, Tianshu; Wu, Xihong; Li, Liang
2018-05-01
Under a noisy "cocktail-party" listening condition with multiple people talking, listeners can use various perceptual/cognitive unmasking cues to improve recognition of the target speech against informational speech-on-speech masking. One potential unmasking cue is the emotion expressed in a speech voice, by means of certain acoustical features. However, it was unclear whether emotionally conditioning a target-speech voice that has none of the typical acoustical features of emotions (i.e., an emotionally neutral voice) can be used by listeners for enhancing target-speech recognition under speech-on-speech masking conditions. In this study we examined the recognition of target speech against a two-talker speech masker both before and after the emotionally neutral target voice was paired with a loud female screaming sound that has a marked negative emotional valence. The results showed that recognition of the target speech (especially the first keyword in a target sentence) was significantly improved by emotionally conditioning the target speaker's voice. Moreover, the emotional unmasking effect was independent of the unmasking effect of the perceived spatial separation between the target speech and the masker. Also, (skin conductance) electrodermal responses became stronger after emotional learning when the target speech and masker were perceptually co-located, suggesting an increase of listening efforts when the target speech was informationally masked. These results indicate that emotionally conditioning the target speaker's voice does not change the acoustical parameters of the target-speech stimuli, but the emotionally conditioned vocal features can be used as cues for unmasking target speech.
Computer Recognition of Facial Profiles
1974-08-01
facial recognition 20. ABSTRACT (Continue on reverse side It necessary and Identify by block number) A system for the recognition of human faces from...21 2.6 Classification Algorithms ........... ... 32 III FACIAL RECOGNITION AND AUTOMATIC TRAINING . . . 37 3.1 Facial Profile Recognition...provide a fair test of the classification system. The work of Goldstein, Harmon, and Lesk [81 indicates, however, that for facial recognition , a ten class
Automatic Mexican sign language and digits recognition using normalized central moments
NASA Astrophysics Data System (ADS)
Solís, Francisco; Martínez, David; Espinosa, Oscar; Toxqui, Carina
2016-09-01
This work presents a framework for automatic Mexican sign language and digits recognition based on computer vision system using normalized central moments and artificial neural networks. Images are captured by digital IP camera, four LED reflectors and a green background in order to reduce computational costs and prevent the use of special gloves. 42 normalized central moments are computed per frame and used in a Multi-Layer Perceptron to recognize each database. Four versions per sign and digit were used in training phase. 93% and 95% of recognition rates were achieved for Mexican sign language and digits respectively.
Cross spectral, active and passive approach to face recognition for improved performance
NASA Astrophysics Data System (ADS)
Grudzien, A.; Kowalski, M.; Szustakowski, M.
2017-08-01
Biometrics is a technique for automatic recognition of a person based on physiological or behavior characteristics. Since the characteristics used are unique, biometrics can create a direct link between a person and identity, based on variety of characteristics. The human face is one of the most important biometric modalities for automatic authentication. The most popular method of face recognition which relies on processing of visual information seems to be imperfect. Thermal infrared imagery may be a promising alternative or complement to visible range imaging due to its several reasons. This paper presents an approach of combining both methods.
Improving language models for radiology speech recognition.
Paulett, John M; Langlotz, Curtis P
2009-02-01
Speech recognition systems have become increasingly popular as a means to produce radiology reports, for reasons both of efficiency and of cost. However, the suboptimal recognition accuracy of these systems can affect the productivity of the radiologists creating the text reports. We analyzed a database of over two million de-identified radiology reports to determine the strongest determinants of word frequency. Our results showed that body site and imaging modality had a similar influence on the frequency of words and of three-word phrases as did the identity of the speaker. These findings suggest that the accuracy of speech recognition systems could be significantly enhanced by further tailoring their language models to body site and imaging modality, which are readily available at the time of report creation.
Speech to Text Translation for Malay Language
NASA Astrophysics Data System (ADS)
Al-khulaidi, Rami Ali; Akmeliawati, Rini
2017-11-01
The speech recognition system is a front end and a back-end process that receives an audio signal uttered by a speaker and converts it into a text transcription. The speech system can be used in several fields including: therapeutic technology, education, social robotics and computer entertainments. In most cases in control tasks, which is the purpose of proposing our system, wherein the speed of performance and response concern as the system should integrate with other controlling platforms such as in voiced controlled robots. Therefore, the need for flexible platforms, that can be easily edited to jibe with functionality of the surroundings, came to the scene; unlike other software programs that require recording audios and multiple training for every entry such as MATLAB and Phoenix. In this paper, a speech recognition system for Malay language is implemented using Microsoft Visual Studio C#. 90 (ninety) Malay phrases were tested by 10 (ten) speakers from both genders in different contexts. The result shows that the overall accuracy (calculated from Confusion Matrix) is satisfactory as it is 92.69%.
Mark My Words: Tone of Voice Changes Affective Word Representations in Memory
Schirmer, Annett
2010-01-01
The present study explored the effect of speaker prosody on the representation of words in memory. To this end, participants were presented with a series of words and asked to remember the words for a subsequent recognition test. During study, words were presented auditorily with an emotional or neutral prosody, whereas during test, words were presented visually. Recognition performance was comparable for words studied with emotional and neutral prosody. However, subsequent valence ratings indicated that study prosody changed the affective representation of words in memory. Compared to words with neutral prosody, words with sad prosody were later rated as more negative and words with happy prosody were later rated as more positive. Interestingly, the participants' ability to remember study prosody failed to predict this effect, suggesting that changes in word valence were implicit and associated with initial word processing rather than word retrieval. Taken together these results identify a mechanism by which speakers can have sustained effects on listener attitudes towards word referents. PMID:20169154
Automatic Speech Recognition from Neural Signals: A Focused Review.
Herff, Christian; Schultz, Tanja
2016-01-01
Speech interfaces have become widely accepted and are nowadays integrated in various real-life applications and devices. They have become a part of our daily life. However, speech interfaces presume the ability to produce intelligible speech, which might be impossible due to either loud environments, bothering bystanders or incapabilities to produce speech (i.e., patients suffering from locked-in syndrome). For these reasons it would be highly desirable to not speak but to simply envision oneself to say words or sentences. Interfaces based on imagined speech would enable fast and natural communication without the need for audible speech and would give a voice to otherwise mute people. This focused review analyzes the potential of different brain imaging techniques to recognize speech from neural signals by applying Automatic Speech Recognition technology. We argue that modalities based on metabolic processes, such as functional Near Infrared Spectroscopy and functional Magnetic Resonance Imaging, are less suited for Automatic Speech Recognition from neural signals due to low temporal resolution but are very useful for the investigation of the underlying neural mechanisms involved in speech processes. In contrast, electrophysiologic activity is fast enough to capture speech processes and is therefor better suited for ASR. Our experimental results indicate the potential of these signals for speech recognition from neural data with a focus on invasively measured brain activity (electrocorticography). As a first example of Automatic Speech Recognition techniques used from neural signals, we discuss the Brain-to-text system.
Quest Hierarchy for Hyperspectral Face Recognition
2011-03-01
numerous face recognition algorithms available, several very good literature surveys are available that include Abate [29], Samal [110], Kong [18], Zou...Perception, Japan (January 1994). [110] Samal , Ashok and P. Iyengar, Automatic Recognition and Analysis of Human Faces and Facial Expressions: A Survey
Constraints on the Transfer of Perceptual Learning in Accented Speech
Eisner, Frank; Melinger, Alissa; Weber, Andrea
2013-01-01
The perception of speech sounds can be re-tuned through a mechanism of lexically driven perceptual learning after exposure to instances of atypical speech production. This study asked whether this re-tuning is sensitive to the position of the atypical sound within the word. We investigated perceptual learning using English voiced stop consonants, which are commonly devoiced in word-final position by Dutch learners of English. After exposure to a Dutch learner’s productions of devoiced stops in word-final position (but not in any other positions), British English (BE) listeners showed evidence of perceptual learning in a subsequent cross-modal priming task, where auditory primes with devoiced final stops (e.g., “seed”, pronounced [si:th]), facilitated recognition of visual targets with voiced final stops (e.g., SEED). In Experiment 1, this learning effect generalized to test pairs where the critical contrast was in word-initial position, e.g., auditory primes such as “town” facilitated recognition of visual targets like DOWN. Control listeners, who had not heard any stops by the speaker during exposure, showed no learning effects. The generalization to word-initial position did not occur when participants had also heard correctly voiced, word-initial stops during exposure (Experiment 2), and when the speaker was a native BE speaker who mimicked the word-final devoicing (Experiment 3). The readiness of the perceptual system to generalize a previously learned adjustment to other positions within the word thus appears to be modulated by distributional properties of the speech input, as well as by the perceived sociophonetic characteristics of the speaker. The results suggest that the transfer of pre-lexical perceptual adjustments that occur through lexically driven learning can be affected by a combination of acoustic, phonological, and sociophonetic factors. PMID:23554598
The Suitability of Cloud-Based Speech Recognition Engines for Language Learning
ERIC Educational Resources Information Center
Daniels, Paul; Iwago, Koji
2017-01-01
As online automatic speech recognition (ASR) engines become more accurate and more widely implemented with call software, it becomes important to evaluate the effectiveness and the accuracy of these recognition engines using authentic speech samples. This study investigates two of the most prominent cloud-based speech recognition engines--Apple's…
The Roles of Suprasegmental Features in Predicting English Oral Proficiency with an Automated System
ERIC Educational Resources Information Center
Kang, Okim; Johnson, David
2018-01-01
Suprasegmental features have received growing attention in the field of oral assessment. In this article we describe a set of computer algorithms that automatically scores the oral proficiency of non-native speakers using unconstrained English speech. The algorithms employ machine learning and 11 suprasegmental measures divided into four groups…
ERIC Educational Resources Information Center
LaCelle-Peterson, Mark W.; Rivera, Charlene
1994-01-01
Educational reforms will not automatically have the same effects for native English speakers and English language learners (ELLs). Equitable assessment for ELLs must consider equity issues of assessment technologies, provide information on ELLs' developing language abilities and content-area achievement, and be comprehensive, flexible, progress…
The influence of tone inventory on ERP without focal attention: a cross-language study.
Zheng, Hong-Ying; Peng, Gang; Chen, Jian-Yong; Zhang, Caicai; Minett, James W; Wang, William S-Y
2014-01-01
This study investigates the effect of tone inventories on brain activities underlying pitch without focal attention. We find that the electrophysiological responses to across-category stimuli are larger than those to within-category stimuli when the pitch contours are superimposed on nonspeech stimuli; however, there is no electrophysiological response difference associated with category status in speech stimuli. Moreover, this category effect in nonspeech stimuli is stronger for Cantonese speakers. Results of previous and present studies lead us to conclude that brain activities to the same native lexical tone contrasts are modulated by speakers' language experiences not only in active phonological processing but also in automatic feature detection without focal attention. In contrast to the condition with focal attention, where phonological processing is stronger for speech stimuli, the feature detection (pitch contours in this study) without focal attention as shaped by language background is superior in relatively regular stimuli, that is, the nonspeech stimuli. The results suggest that Cantonese listeners outperform Mandarin listeners in automatic detection of pitch features because of the denser Cantonese tone system.
Can a CNN recognize Catalan diet?
NASA Astrophysics Data System (ADS)
Herruzo, P.; Bolaños, M.; Radeva, P.
2016-10-01
Nowadays, we can find several diseases related to the unhealthy diet habits of the population, such as diabetes, obesity, anemia, bulimia and anorexia. In many cases, these diseases are related to the food consumption of people. Mediterranean diet is scientifically known as a healthy diet that helps to prevent many metabolic diseases. In particular, our work focuses on the recognition of Mediterranean food and dishes. The development of this methodology would allow to analise the daily habits of users with wearable cameras, within the topic of lifelogging. By using automatic mechanisms we could build an objective tool for the analysis of the patient's behavior, allowing specialists to discover unhealthy food patterns and understand the user's lifestyle. With the aim to automatically recognize a complete diet, we introduce a challenging multi-labeled dataset related to Mediter-ranean diet called FoodCAT. The first type of label provided consists of 115 food classes with an average of 400 images per dish, and the second one consists of 12 food categories with an average of 3800 pictures per class. This dataset will serve as a basis for the development of automatic diet recognition. In this context, deep learning and more specifically, Convolutional Neural Networks (CNNs), currently are state-of-the-art methods for automatic food recognition. In our work, we compare several architectures for image classification, with the purpose of diet recognition. Applying the best model for recognising food categories, we achieve a top-1 accuracy of 72.29%, and top-5 of 97.07%. In a complete diet recognition of dishes from Mediterranean diet, enlarged with the Food-101 dataset for international dishes recognition, we achieve a top-1 accuracy of 68.07%, and top-5 of 89.53%, for a total of 115+101 food classes.
Estes, Zachary; Adelman, James S
2008-08-01
An automatic vigilance hypothesis states that humans preferentially attend to negative stimuli, and this attention to negative valence disrupts the processing of other stimulus properties. Thus, negative words typically elicit slower color naming, word naming, and lexical decisions than neutral or positive words. Larsen, Mercer, and Balota analyzed the stimuli from 32 published studies, and they found that word valence was confounded with several lexical factors known to affect word recognition. Indeed, with these lexical factors covaried out, Larsen et al. found no evidence of automatic vigilance. The authors report a more sensitive analysis of 1011 words. Results revealed a small but reliable valence effect, such that negative words (e.g., "shark") elicit slower lexical decisions and naming than positive words (e.g., "beach"). Moreover, the relation between valence and recognition was categorical rather than linear; the extremity of a word's valence did not affect its recognition. This valence effect was not attributable to word length, frequency, orthographic neighborhood size, contextual diversity, first phoneme, or arousal. Thus, the present analysis provides the most powerful demonstration of automatic vigilance to date.
Automatic Recognition of Phonemes Using a Syntactic Processor for Error Correction.
1980-12-01
OF PHONEMES USING A SYNTACTIC PROCESSOR FOR ERROR CORRECTION THESIS AFIT/GE/EE/8D-45 Robert B. ’Taylor 2Lt USAF Approved for public release...distribution unlimilted. AbP AFIT/GE/EE/ 80D-45 AUTOMATIC RECOGNITION OF PHONEMES USING A SYNTACTIC PROCESSOR FOR ERROR CORRECTION THESIS Presented to the...Testing ..................... 37 Bayes Decision Rule for Minimum Error ........... 37 Bayes Decision Rule for Minimum Risk ............ 39 Mini Max Test
Action Unit Models of Facial Expression of Emotion in the Presence of Speech
Shah, Miraj; Cooper, David G.; Cao, Houwei; Gur, Ruben C.; Nenkova, Ani; Verma, Ragini
2014-01-01
Automatic recognition of emotion using facial expressions in the presence of speech poses a unique challenge because talking reveals clues for the affective state of the speaker but distorts the canonical expression of emotion on the face. We introduce a corpus of acted emotion expression where speech is either present (talking) or absent (silent). The corpus is uniquely suited for analysis of the interplay between the two conditions. We use a multimodal decision level fusion classifier to combine models of emotion from talking and silent faces as well as from audio to recognize five basic emotions: anger, disgust, fear, happy and sad. Our results strongly indicate that emotion prediction in the presence of speech from action unit facial features is less accurate when the person is talking. Modeling talking and silent expressions separately and fusing the two models greatly improves accuracy of prediction in the talking setting. The advantages are most pronounced when silent and talking face models are fused with predictions from audio features. In this multi-modal prediction both the combination of modalities and the separate models of talking and silent facial expression of emotion contribute to the improvement. PMID:25525561
Free Field Word recognition test in the presence of noise in normal hearing adults.
Almeida, Gleide Viviani Maciel; Ribas, Angela; Calleros, Jorge
In ideal listening situations, subjects with normal hearing can easily understand speech, as can many subjects who have a hearing loss. To present the validation of the Word Recognition Test in a Free Field in the Presence of Noise in normal-hearing adults. Sample consisted of 100 healthy adults over 18 years of age with normal hearing. After pure tone audiometry, a speech recognition test was applied in free field condition with monosyllables and disyllables, with standardized material in three listening situations: optimal listening condition (no noise), with a signal to noise ratio of 0dB and a signal to noise ratio of -10dB. For these tests, an environment in calibrated free field was arranged where speech was presented to the subject being tested from two speakers located at 45°, and noise from a third speaker, located at 180°. All participants had speech audiometry results in the free field between 88% and 100% in the three listening situations. Word Recognition Test in Free Field in the Presence of Noise proved to be easy to be organized and applied. The results of the test validation suggest that individuals with normal hearing should get between 88% and 100% of the stimuli correct. The test can be an important tool in measuring noise interference on the speech perception abilities. Copyright © 2016 Associação Brasileira de Otorrinolaringologia e Cirurgia Cérvico-Facial. Published by Elsevier Editora Ltda. All rights reserved.
NASA Astrophysics Data System (ADS)
Iqbal, Asim; Farooq, Umar; Mahmood, Hassan; Asad, Muhammad Usman; Khan, Akrama; Atiq, Hafiz Muhammad
2010-02-01
A self teaching image processing and voice recognition based system is developed to educate visually impaired children, chiefly in their primary education. System comprises of a computer, a vision camera, an ear speaker and a microphone. Camera, attached with the computer system is mounted on the ceiling opposite (on the required angle) to the desk on which the book is placed. Sample images and voices in the form of instructions and commands of English, Urdu alphabets, Numeric Digits, Operators and Shapes are already stored in the database. A blind child first reads the embossed character (object) with the help of fingers than he speaks the answer, name of the character, shape etc into the microphone. With the voice command of a blind child received by the microphone, image is taken by the camera which is processed by MATLAB® program developed with the help of Image Acquisition and Image processing toolbox and generates a response or required set of instructions to child via ear speaker, resulting in self education of a visually impaired child. Speech recognition program is also developed in MATLAB® with the help of Data Acquisition and Signal Processing toolbox which records and process the command of the blind child.
NASA Technical Reports Server (NTRS)
Tescher, Andrew G. (Editor)
1989-01-01
Various papers on image compression and automatic target recognition are presented. Individual topics addressed include: target cluster detection in cluttered SAR imagery, model-based target recognition using laser radar imagery, Smart Sensor front-end processor for feature extraction of images, object attitude estimation and tracking from a single video sensor, symmetry detection in human vision, analysis of high resolution aerial images for object detection, obscured object recognition for an ATR application, neural networks for adaptive shape tracking, statistical mechanics and pattern recognition, detection of cylinders in aerial range images, moving object tracking using local windows, new transform method for image data compression, quad-tree product vector quantization of images, predictive trellis encoding of imagery, reduced generalized chain code for contour description, compact architecture for a real-time vision system, use of human visibility functions in segmentation coding, color texture analysis and synthesis using Gibbs random fields.
NASA Astrophysics Data System (ADS)
Zhang, Shijun; Jing, Zhongliang; Li, Jianxun
2005-01-01
The rotation invariant feature of the target is obtained using the multi-direction feature extraction property of the steerable filter. Combining the morphological operation top-hat transform with the self-organizing feature map neural network, the adaptive topological region is selected. Using the erosion operation, the topological region shrinkage is achieved. The steerable filter based morphological self-organizing feature map neural network is applied to automatic target recognition of binary standard patterns and real-world infrared sequence images. Compared with Hamming network and morphological shared-weight networks respectively, the higher recognition correct rate, robust adaptability, quick training, and better generalization of the proposed method are achieved.
Speaker-independent phoneme recognition with a binaural auditory image model
NASA Astrophysics Data System (ADS)
Francis, Keith Ivan
1997-09-01
This dissertation presents phoneme recognition techniques based on a binaural fusion of outputs of the auditory image model and subsequent azimuth-selective phoneme recognition in a noisy environment. Background information concerning speech variations, phoneme recognition, current binaural fusion techniques and auditory modeling issues is explained. The research is constrained to sources in the frontal azimuthal plane of a simulated listener. A new method based on coincidence detection of neural activity patterns from the auditory image model of Patterson is used for azimuth-selective phoneme recognition. The method is tested in various levels of noise and the results are reported in contrast to binaural fusion methods based on various forms of correlation to demonstrate the potential of coincidence- based binaural phoneme recognition. This method overcomes smearing of fine speech detail typical of correlation based methods. Nevertheless, coincidence is able to measure similarity of left and right inputs and fuse them into useful feature vectors for phoneme recognition in noise.
Speech recognition for embedded automatic positioner for laparoscope
NASA Astrophysics Data System (ADS)
Chen, Xiaodong; Yin, Qingyun; Wang, Yi; Yu, Daoyin
2014-07-01
In this paper a novel speech recognition methodology based on Hidden Markov Model (HMM) is proposed for embedded Automatic Positioner for Laparoscope (APL), which includes a fixed point ARM processor as the core. The APL system is designed to assist the doctor in laparoscopic surgery, by implementing the specific doctor's vocal control to the laparoscope. Real-time respond to the voice commands asks for more efficient speech recognition algorithm for the APL. In order to reduce computation cost without significant loss in recognition accuracy, both arithmetic and algorithmic optimizations are applied in the method presented. First, depending on arithmetic optimizations most, a fixed point frontend for speech feature analysis is built according to the ARM processor's character. Then the fast likelihood computation algorithm is used to reduce computational complexity of the HMM-based recognition algorithm. The experimental results show that, the method shortens the recognition time within 0.5s, while the accuracy higher than 99%, demonstrating its ability to achieve real-time vocal control to the APL.
Automatic recognition of ship types from infrared images using superstructure moment invariants
NASA Astrophysics Data System (ADS)
Li, Heng; Wang, Xinyu
2007-11-01
Automatic object recognition is an active area of interest for military and commercial applications. In this paper, a system addressing autonomous recognition of ship types in infrared images is proposed. Firstly, an approach of segmentation based on detection of salient features of the target with subsequent shadow removing is proposed, as is the base of the subsequent object recognition. Considering the differences between the shapes of various ships mainly lie in their superstructures, we then use superstructure moment functions invariant to translation, rotation and scale differences in input patterns and develop a robust algorithm of obtaining ship superstructure. Subsequently a back-propagation neural network is used as a classifier in the recognition stage and projection images of simulated three-dimensional ship models are used as the training sets. Our recognition model was implemented and experimentally validated using both simulated three-dimensional ship model images and real images derived from video of an AN/AAS-44V Forward Looking Infrared(FLIR) sensor.
Translation Ambiguity but Not Word Class Predicts Translation Performance
ERIC Educational Resources Information Center
Prior, Anat; Kroll, Judith F.; Macwhinney, Brian
2013-01-01
We investigated the influence of word class and translation ambiguity on cross-linguistic representation and processing. Bilingual speakers of English and Spanish performed translation production and translation recognition tasks on nouns and verbs in both languages. Words either had a single translation or more than one translation. Translation…
Scenario-Based Spoken Interaction with Virtual Agents
ERIC Educational Resources Information Center
Morton, Hazel; Jack, Mervyn A.
2005-01-01
This paper describes a CALL approach which integrates software for speaker independent continuous speech recognition with embodied virtual agents and virtual worlds to create an immersive environment in which learners can converse in the target language in contextualised scenarios. The result is a self-access learning package: SPELL (Spoken…
Inferring Speaker Affect in Spoken Natural Language Communication
ERIC Educational Resources Information Center
Pon-Barry, Heather Roberta
2013-01-01
The field of spoken language processing is concerned with creating computer programs that can understand human speech and produce human-like speech. Regarding the problem of understanding human speech, there is currently growing interest in moving beyond speech recognition (the task of transcribing the words in an audio stream) and towards…
Understanding Cognitive Development: Automaticity and the Early Years Child
ERIC Educational Resources Information Center
Gray, Colette
2004-01-01
In recent years a growing body of evidence has implicated deficits in the automaticity of fundamental facts such as word and number recognition in a range of disorders: including attention deficit hyperactivity disorder, dyslexia, apraxia and autism. Variously described as habits, fluency, chunking and over learning, automatic processes are best…
NASA Astrophysics Data System (ADS)
O'Sullivan, James; Chen, Zhuo; Herrero, Jose; McKhann, Guy M.; Sheth, Sameer A.; Mehta, Ashesh D.; Mesgarani, Nima
2017-10-01
Objective. People who suffer from hearing impairments can find it difficult to follow a conversation in a multi-speaker environment. Current hearing aids can suppress background noise; however, there is little that can be done to help a user attend to a single conversation amongst many without knowing which speaker the user is attending to. Cognitively controlled hearing aids that use auditory attention decoding (AAD) methods are the next step in offering help. Translating the successes in AAD research to real-world applications poses a number of challenges, including the lack of access to the clean sound sources in the environment with which to compare with the neural signals. We propose a novel framework that combines single-channel speech separation algorithms with AAD. Approach. We present an end-to-end system that (1) receives a single audio channel containing a mixture of speakers that is heard by a listener along with the listener’s neural signals, (2) automatically separates the individual speakers in the mixture, (3) determines the attended speaker, and (4) amplifies the attended speaker’s voice to assist the listener. Main results. Using invasive electrophysiology recordings, we identified the regions of the auditory cortex that contribute to AAD. Given appropriate electrode locations, our system is able to decode the attention of subjects and amplify the attended speaker using only the mixed audio. Our quality assessment of the modified audio demonstrates a significant improvement in both subjective and objective speech quality measures. Significance. Our novel framework for AAD bridges the gap between the most recent advancements in speech processing technologies and speech prosthesis research and moves us closer to the development of cognitively controlled hearable devices for the hearing impaired.
Evaluating deep learning architectures for Speech Emotion Recognition.
Fayek, Haytham M; Lech, Margaret; Cavedon, Lawrence
2017-08-01
Speech Emotion Recognition (SER) can be regarded as a static or dynamic classification problem, which makes SER an excellent test bed for investigating and comparing various deep learning architectures. We describe a frame-based formulation to SER that relies on minimal speech processing and end-to-end deep learning to model intra-utterance dynamics. We use the proposed SER system to empirically explore feed-forward and recurrent neural network architectures and their variants. Experiments conducted illuminate the advantages and limitations of these architectures in paralinguistic speech recognition and emotion recognition in particular. As a result of our exploration, we report state-of-the-art results on the IEMOCAP database for speaker-independent SER and present quantitative and qualitative assessments of the models' performances. Copyright © 2017 Elsevier Ltd. All rights reserved.
Definition and automatic anatomy recognition of lymph node zones in the pelvis on CT images
NASA Astrophysics Data System (ADS)
Liu, Yu; Udupa, Jayaram K.; Odhner, Dewey; Tong, Yubing; Guo, Shuxu; Attor, Rosemary; Reinicke, Danica; Torigian, Drew A.
2016-03-01
Currently, unlike IALSC-defined thoracic lymph node zones, no explicitly provided definitions for lymph nodes in other body regions are available. Yet, definitions are critical for standardizing the recognition, delineation, quantification, and reporting of lymphadenopathy in other body regions. Continuing from our previous work in the thorax, this paper proposes a standardized definition of the grouping of pelvic lymph nodes into 10 zones. We subsequently employ our earlier Automatic Anatomy Recognition (AAR) framework designed for body-wide organ modeling, recognition, and delineation to actually implement these zonal definitions where the zones are treated as anatomic objects. First, all 10 zones and key anatomic organs used as anchors are manually delineated under expert supervision for constructing fuzzy anatomy models of the assembly of organs together with the zones. Then, optimal hierarchical arrangement of these objects is constructed for the purpose of achieving the best zonal recognition. For actual localization of the objects, two strategies are used -- optimal thresholded search for organs and one-shot method for the zones where the known relationship of the zones to key organs is exploited. Based on 50 computed tomography (CT) image data sets for the pelvic body region and an equal division into training and test subsets, automatic zonal localization within 1-3 voxels is achieved.
Automatic recognition and analysis of synapses. [in brain tissue
NASA Technical Reports Server (NTRS)
Ungerleider, J. A.; Ledley, R. S.; Bloom, F. E.
1976-01-01
An automatic system for recognizing synaptic junctions would allow analysis of large samples of tissue for the possible classification of specific well-defined sets of synapses based upon structural morphometric indices. In this paper the three steps of our system are described: (1) cytochemical tissue preparation to allow easy recognition of the synaptic junctions; (2) transmitting the tissue information to a computer; and (3) analyzing each field to recognize the synapses and make measurements on them.
Health smart home for elders - a tool for automatic recognition of activities of daily living.
Le, Xuan Hoa Binh; Di Mascolo, Maria; Gouin, Alexia; Noury, Norbert
2008-01-01
Elders live preferently in their own home, but with aging comes the loss of autonomy and associated risks. In order to help them live longer in safe conditions, we need a tool to automatically detect their loss of autonomy by assessing the degree of performance of activities of daily living. This article presents an approach enabling the activities recognition of an elder living alone in a home equipped with noninvasive sensors.
Unification of automatic target tracking and automatic target recognition
NASA Astrophysics Data System (ADS)
Schachter, Bruce J.
2014-06-01
The subject being addressed is how an automatic target tracker (ATT) and an automatic target recognizer (ATR) can be fused together so tightly and so well that their distinctiveness becomes lost in the merger. This has historically not been the case outside of biology and a few academic papers. The biological model of ATT∪ATR arises from dynamic patterns of activity distributed across many neural circuits and structures (including retina). The information that the brain receives from the eyes is "old news" at the time that it receives it. The eyes and brain forecast a tracked object's future position, rather than relying on received retinal position. Anticipation of the next moment - building up a consistent perception - is accomplished under difficult conditions: motion (eyes, head, body, scene background, target) and processing limitations (neural noise, delays, eye jitter, distractions). Not only does the human vision system surmount these problems, but it has innate mechanisms to exploit motion in support of target detection and classification. Biological vision doesn't normally operate on snapshots. Feature extraction, detection and recognition are spatiotemporal. When vision is viewed as a spatiotemporal process, target detection, recognition, tracking, event detection and activity recognition, do not seem as distinct as they are in current ATT and ATR designs. They appear as similar mechanism taking place at varying time scales. A framework is provided for unifying ATT and ATR.
Face averages enhance user recognition for smartphone security.
Robertson, David J; Kramer, Robin S S; Burton, A Mike
2015-01-01
Our recognition of familiar faces is excellent, and generalises across viewing conditions. However, unfamiliar face recognition is much poorer. For this reason, automatic face recognition systems might benefit from incorporating the advantages of familiarity. Here we put this to the test using the face verification system available on a popular smartphone (the Samsung Galaxy). In two experiments we tested the recognition performance of the smartphone when it was encoded with an individual's 'face-average'--a representation derived from theories of human face perception. This technique significantly improved performance for both unconstrained celebrity images (Experiment 1) and for real faces (Experiment 2): users could unlock their phones more reliably when the device stored an average of the user's face than when they stored a single image. This advantage was consistent across a wide variety of everyday viewing conditions. Furthermore, the benefit did not reduce the rejection of imposter faces. This benefit is brought about solely by consideration of suitable representations for automatic face recognition, and we argue that this is just as important as development of matching algorithms themselves. We propose that this representation could significantly improve recognition rates in everyday settings.
Event identification by acoustic signature recognition
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dress, W.B.; Kercel, S.W.
1995-07-01
Many events of interest to the security commnnity produce acoustic emissions that are, in principle, identifiable as to cause. Some obvious examples are gunshots, breaking glass, takeoffs and landings of small aircraft, vehicular engine noises, footsteps (high frequencies when on gravel, very low frequencies. when on soil), and voices (whispers to shouts). We are investigating wavelet-based methods to extract unique features of such events for classification and identification. We also discuss methods of classification and pattern recognition specifically tailored for acoustic signatures obtained by wavelet analysis. The paper is divided into three parts: completed work, work in progress, and futuremore » applications. The completed phase has led to the successful recognition of aircraft types on landing and takeoff. Both small aircraft (twin-engine turboprop) and large (commercial airliners) were included in the study. The project considered the design of a small, field-deployable, inexpensive device. The techniques developed during the aircraft identification phase were then adapted to a multispectral electromagnetic interference monitoring device now deployed in a nuclear power plant. This is a general-purpose wavelet analysis engine, spanning 14 octaves, and can be adapted for other specific tasks. Work in progress is focused on applying the methods previously developed to speaker identification. Some of the problems to be overcome include recognition of sounds as voice patterns and as distinct from possible background noises (e.g., music), as well as identification of the speaker from a short-duration voice sample. A generalization of the completed work and the work in progress is a device capable of classifying any number of acoustic events-particularly quasi-stationary events such as engine noises and voices and singular events such as gunshots and breaking glass. We will show examples of both kinds of events and discuss their recognition likelihood.« less
Advancements in robust algorithm formulation for speaker identification of whispered speech
NASA Astrophysics Data System (ADS)
Fan, Xing
Whispered speech is an alternative speech production mode from neutral speech, which is used by talkers intentionally in natural conversational scenarios to protect privacy and to avoid certain content from being overheard/made public. Due to the profound differences between whispered and neutral speech in production mechanism and the absence of whispered adaptation data, the performance of speaker identification systems trained with neutral speech degrades significantly. This dissertation therefore focuses on developing a robust closed-set speaker recognition system for whispered speech by using no or limited whispered adaptation data from non-target speakers. This dissertation proposes the concept of "High''/"Low'' performance whispered data for the purpose of speaker identification. A variety of acoustic properties are identified that contribute to the quality of whispered data. An acoustic analysis is also conducted to compare the phoneme/speaker dependency of the differences between whispered and neutral data in the feature domain. The observations from those acoustic analysis are new in this area and also serve as a guidance for developing robust speaker identification systems for whispered speech. This dissertation further proposes two systems for speaker identification of whispered speech. One system focuses on front-end processing. A two-dimensional feature space is proposed to search for "Low''-quality performance based whispered utterances and separate feature mapping functions are applied to vowels and consonants respectively in order to retain the speaker's information shared between whispered and neutral speech. The other system focuses on speech-mode-independent model training. The proposed method generates pseudo whispered features from neutral features by using the statistical information contained in a whispered Universal Background model (UBM) trained from extra collected whispered data from non-target speakers. Four modeling methods are proposed for the transformation estimation in order to generate the pseudo whispered features. Both of the above two systems demonstrate a significant improvement over the baseline system on the evaluation data. This dissertation has therefore contributed to providing a scientific understanding of the differences between whispered and neutral speech as well as improved front-end processing and modeling method for speaker identification of whispered speech. Such advancements will ultimately contribute to improve the robustness of speech processing systems.
ERIC Educational Resources Information Center
Horton, William S.
2007-01-01
In typical interactions, speakers frequently produce utterances that appear to reflect beliefs about the common ground shared with particular addressees. Horton and Gerrig (2005a) proposed that one important basis for audience design is the manner in which conversational partners serve as cues for the automatic retrieval of associated information…
Frequency of Use Leads to Automaticity of Production: Evidence from Repair in Conversation
ERIC Educational Resources Information Center
Kapatsinski, Vsevolod
2010-01-01
In spontaneous speech, speakers sometimes replace a word they have just produced or started producing by another word. The present study reports that in these replacement repairs, low-frequency replaced words are more likely to be interrupted prior to completion than high-frequency words, providing support to the hypothesis that the production of…
ERIC Educational Resources Information Center
Lansman, Marcy, Ed.; Hunt, Earl, Ed.
This technical report contains papers prepared by the 11 speakers at the 1980 Lake Wilderness (Seattle, Washington) Conference on Attention. The papers are divided into general models, physiological evidence, and visual attention categories. Topics of the papers include the following: (1) willed versus automatic control of behavior; (2) multiple…
Methods and apparatus for non-acoustic speech characterization and recognition
Holzrichter, John F.
1999-01-01
By simultaneously recording EM wave reflections and acoustic speech information, the positions and velocities of the speech organs as speech is articulated can be defined for each acoustic speech unit. Well defined time frames and feature vectors describing the speech, to the degree required, can be formed. Such feature vectors can uniquely characterize the speech unit being articulated each time frame. The onset of speech, rejection of external noise, vocalized pitch periods, articulator conditions, accurate timing, the identification of the speaker, acoustic speech unit recognition, and organ mechanical parameters can be determined.
Methods and apparatus for non-acoustic speech characterization and recognition
DOE Office of Scientific and Technical Information (OSTI.GOV)
Holzrichter, J.F.
By simultaneously recording EM wave reflections and acoustic speech information, the positions and velocities of the speech organs as speech is articulated can be defined for each acoustic speech unit. Well defined time frames and feature vectors describing the speech, to the degree required, can be formed. Such feature vectors can uniquely characterize the speech unit being articulated each time frame. The onset of speech, rejection of external noise, vocalized pitch periods, articulator conditions, accurate timing, the identification of the speaker, acoustic speech unit recognition, and organ mechanical parameters can be determined.
Robust Speaker Authentication Based on Combined Speech and Voiceprint Recognition
NASA Astrophysics Data System (ADS)
Malcangi, Mario
2009-08-01
Personal authentication is becoming increasingly important in many applications that have to protect proprietary data. Passwords and personal identification numbers (PINs) prove not to be robust enough to ensure that unauthorized people do not use them. Biometric authentication technology may offer a secure, convenient, accurate solution but sometimes fails due to its intrinsically fuzzy nature. This research aims to demonstrate that combining two basic speech processing methods, voiceprint identification and speech recognition, can provide a very high degree of robustness, especially if fuzzy decision logic is used.
ERIC Educational Resources Information Center
Frye, Elizabeth M.; Gosky, Ross
2012-01-01
The present study investigated the relationship between rapid recognition of individual words (Word Recognition Test) and two measures of contextual reading: (1) grade-level Passage Reading Test (IRI passage) and (2) performance on standardized STAR Reading Test. To establish if time of presentation on the word recognition test was a factor in…
New technique for real-time distortion-invariant multiobject recognition and classification
NASA Astrophysics Data System (ADS)
Hong, Rutong; Li, Xiaoshun; Hong, En; Wang, Zuyi; Wei, Hongan
2001-04-01
A real-time hybrid distortion-invariant OPR system was established to make 3D multiobject distortion-invariant automatic pattern recognition. Wavelet transform technique was used to make digital preprocessing of the input scene, to depress the noisy background and enhance the recognized object. A three-layer backpropagation artificial neural network was used in correlation signal post-processing to perform multiobject distortion-invariant recognition and classification. The C-80 and NOA real-time processing ability and the multithread programming technology were used to perform high speed parallel multitask processing and speed up the post processing rate to ROIs. The reference filter library was constructed for the distortion version of 3D object model images based on the distortion parameter tolerance measuring as rotation, azimuth and scale. The real-time optical correlation recognition testing of this OPR system demonstrates that using the preprocessing, post- processing, the nonlinear algorithm os optimum filtering, RFL construction technique and the multithread programming technology, a high possibility of recognition and recognition rate ere obtained for the real-time multiobject distortion-invariant OPR system. The recognition reliability and rate was improved greatly. These techniques are very useful to automatic target recognition.
Voice reaction times with recognition for Commodore computers
NASA Technical Reports Server (NTRS)
Washburn, David A.; Putney, R. Thompson
1990-01-01
Hardware and software modifications are presented that allow for collection and recognition by a Commodore computer of spoken responses. Responses are timed with millisecond accuracy and automatically analyzed and scored. Accuracy data for this device from several experiments are presented. Potential applications and suggestions for improving recognition accuracy are also discussed.
Recognition of surface lithologic and topographic patterns in southwest Colorado with ADP techniques
NASA Technical Reports Server (NTRS)
Melhorn, W. N.; Sinnock, S.
1973-01-01
Analysis of ERTS-1 multispectral data by automatic pattern recognition procedures is applicable toward grappling with current and future resource stresses by providing a means for refining existing geologic maps. The procedures used in the current analysis already yield encouraging results toward the eventual machine recognition of extensive surface lithologic and topographic patterns. Automatic mapping of a series of hogbacks, strike valleys, and alluvial surfaces along the northwest flank of the San Juan Basin in Colorado can be obtained by minimal man-machine interaction. The determination of causes for separable spectral signatures is dependent upon extensive correlation of micro- and macro field based ground truth observations and aircraft underflight data with the satellite data.
Automatic Facial Expression Recognition and Operator Functional State
NASA Technical Reports Server (NTRS)
Blanson, Nina
2012-01-01
The prevalence of human error in safety-critical occupations remains a major challenge to mission success despite increasing automation in control processes. Although various methods have been proposed to prevent incidences of human error, none of these have been developed to employ the detection and regulation of Operator Functional State (OFS), or the optimal condition of the operator while performing a task, in work environments due to drawbacks such as obtrusiveness and impracticality. A video-based system with the ability to infer an individual's emotional state from facial feature patterning mitigates some of the problems associated with other methods of detecting OFS, like obtrusiveness and impracticality in integration with the mission environment. This paper explores the utility of facial expression recognition as a technology for inferring OFS by first expounding on the intricacies of OFS and the scientific background behind emotion and its relationship with an individual's state. Then, descriptions of the feedback loop and the emotion protocols proposed for the facial recognition program are explained. A basic version of the facial expression recognition program uses Haar classifiers and OpenCV libraries to automatically locate key facial landmarks during a live video stream. Various methods of creating facial expression recognition software are reviewed to guide future extensions of the program. The paper concludes with an examination of the steps necessary in the research of emotion and recommendations for the creation of an automatic facial expression recognition program for use in real-time, safety-critical missions
Automatic Facial Expression Recognition and Operator Functional State
NASA Technical Reports Server (NTRS)
Blanson, Nina
2011-01-01
The prevalence of human error in safety-critical occupations remains a major challenge to mission success despite increasing automation in control processes. Although various methods have been proposed to prevent incidences of human error, none of these have been developed to employ the detection and regulation of Operator Functional State (OFS), or the optimal condition of the operator while performing a task, in work environments due to drawbacks such as obtrusiveness and impracticality. A video-based system with the ability to infer an individual's emotional state from facial feature patterning mitigates some of the problems associated with other methods of detecting OFS, like obtrusiveness and impracticality in integration with the mission environment. This paper explores the utility of facial expression recognition as a technology for inferring OFS by first expounding on the intricacies of OFS and the scientific background behind emotion and its relationship with an individual's state. Then, descriptions of the feedback loop and the emotion protocols proposed for the facial recognition program are explained. A basic version of the facial expression recognition program uses Haar classifiers and OpenCV libraries to automatically locate key facial landmarks during a live video stream. Various methods of creating facial expression recognition software are reviewed to guide future extensions of the program. The paper concludes with an examination of the steps necessary in the research of emotion and recommendations for the creation of an automatic facial expression recognition program for use in real-time, safety-critical missions.
Automatic integration of social information in emotion recognition.
Mumenthaler, Christian; Sander, David
2015-04-01
This study investigated the automaticity of the influence of social inference on emotion recognition. Participants were asked to recognize dynamic facial expressions of emotion (fear or anger in Experiment 1 and blends of fear and surprise or of anger and disgust in Experiment 2) in a target face presented at the center of a screen while a subliminal contextual face appearing in the periphery expressed an emotion (fear or anger) or not (neutral) and either looked at the target face or not. Results of Experiment 1 revealed that recognition of the target emotion of fear was improved when a subliminal angry contextual face gazed toward-rather than away from-the fearful face. We replicated this effect in Experiment 2, in which facial expression blends of fear and surprise were more often and more rapidly categorized as expressing fear when the subliminal contextual face expressed anger and gazed toward-rather than away from-the target face. With the contextual face appearing for 30 ms in total, including only 10 ms of emotion expression, and being immediately masked, our data provide the first evidence that social influence on emotion recognition can occur automatically. (c) 2015 APA, all rights reserved).
Automatic image database generation from CAD for 3D object recognition
NASA Astrophysics Data System (ADS)
Sardana, Harish K.; Daemi, Mohammad F.; Ibrahim, Mohammad K.
1993-06-01
The development and evaluation of Multiple-View 3-D object recognition systems is based on a large set of model images. Due to the various advantages of using CAD, it is becoming more and more practical to use existing CAD data in computer vision systems. Current PC- level CAD systems are capable of providing physical image modelling and rendering involving positional variations in cameras, light sources etc. We have formulated a modular scheme for automatic generation of various aspects (views) of the objects in a model based 3-D object recognition system. These views are generated at desired orientations on the unit Gaussian sphere. With a suitable network file sharing system (NFS), the images can directly be stored on a database located on a file server. This paper presents the image modelling solutions using CAD in relation to multiple-view approach. Our modular scheme for data conversion and automatic image database storage for such a system is discussed. We have used this approach in 3-D polyhedron recognition. An overview of the results, advantages and limitations of using CAD data and conclusions using such as scheme are also presented.
Automatic lip reading by using multimodal visual features
NASA Astrophysics Data System (ADS)
Takahashi, Shohei; Ohya, Jun
2013-12-01
Since long time ago, speech recognition has been researched, though it does not work well in noisy places such as in the car or in the train. In addition, people with hearing-impaired or difficulties in hearing cannot receive benefits from speech recognition. To recognize the speech automatically, visual information is also important. People understand speeches from not only audio information, but also visual information such as temporal changes in the lip shape. A vision based speech recognition method could work well in noisy places, and could be useful also for people with hearing disabilities. In this paper, we propose an automatic lip-reading method for recognizing the speech by using multimodal visual information without using any audio information such as speech recognition. First, the ASM (Active Shape Model) is used to track and detect the face and lip in a video sequence. Second, the shape, optical flow and spatial frequencies of the lip features are extracted from the lip detected by ASM. Next, the extracted multimodal features are ordered chronologically so that Support Vector Machine is performed in order to learn and classify the spoken words. Experiments for classifying several words show promising results of this proposed method.
Neural Processing of Vocal Emotion and Identity
ERIC Educational Resources Information Center
Spreckelmeyer, Katja N.; Kutas, Marta; Urbach, Thomas; Altenmuller, Eckart; Munte, Thomas F.
2009-01-01
The voice is a marker of a person's identity which allows individual recognition even if the person is not in sight. Listening to a voice also affords inferences about the speaker's emotional state. Both these types of personal information are encoded in characteristic acoustic feature patterns analyzed within the auditory cortex. In the present…
Embedding speech into virtual realities
NASA Technical Reports Server (NTRS)
Bohn, Christian-Arved; Krueger, Wolfgang
1993-01-01
In this work a speaker-independent speech recognition system is presented, which is suitable for implementation in Virtual Reality applications. The use of an artificial neural network in connection with a special compression of the acoustic input leads to a system, which is robust, fast, easy to use and needs no additional hardware, beside a common VR-equipment.
Speaker-dependent Multipitch Tracking Using Deep Neural Networks
2015-01-01
connections through time. Studies have shown that RNNs are good at modeling sequential data like handwriting [12] and speech [26]. We plan to explore RNNs in...Schmidhuber, and S. Fernández, “Unconstrained on-line handwriting recognition with recurrent neural networks,” in Proceedings of NIPS, 2008, pp. 577–584. [13
Perceiving and Remembering Events Cross-Linguistically: Evidence from Dual-Task Paradigms
ERIC Educational Resources Information Center
Trueswell, John C.; Papafragou, Anna
2010-01-01
What role does language play during attention allocation in perceiving and remembering events? We recorded adults' eye movements as they studied animated motion events for a later recognition task. We compared native speakers of two languages that use different means of expressing motion (Greek and English). In Experiment 1, eye movements revealed…
2010-12-01
discovered that the NSA is concerned about speaker recognition being vulnerable to man- in-the-middle ( MITM ) attacks. The professional could tailor an MITM ...with the results of the test against the MITM threat. The Collective Acquisition framework comprises powerful search techniques found in the CRC
Effects of Speed of Word Processing on Semantic Access: The Case of Bilingualism
ERIC Educational Resources Information Center
Martin, Clara D.; Costa, Albert; Dering, Benjamin; Hoshino, Noriko; Wu, Yan Jing; Thierry, Guillaume
2012-01-01
Bilingual speakers generally manifest slower word recognition than monolinguals. We investigated the consequences of the word processing speed on semantic access in bilinguals. The paradigm involved a stream of English words and pseudowords presented in succession at a constant rate. English-Welsh bilinguals and English monolinguals were asked to…
Fast Morphological Effects in First and Second Language Word Recognition
ERIC Educational Resources Information Center
Diependaele, Kevin; Dunabeitia, Jon Andoni; Morris, Joanna; Keuleers, Emmanuel
2011-01-01
In three experiments we compared the performance of native English speakers to that of Spanish-English and Dutch-English bilinguals on a masked morphological priming lexical decision task. The results do not show significant differences across the three experiments. In line with recent meta-analyses, we observed a graded pattern of facilitation…
Chinese-Mandarin: Basic Course. Volume VII: Lessons 72-79.
ERIC Educational Resources Information Center
Defense Language Inst., Monterey, CA.
This is the seventh of 16 volumes of audiolingual classroom instruction in Mandarin Chinese. The course is designed to train native English speakers to Level 3 Foreign Service Institute proficiency in comprehension and speaking, and to Level 2 proficiency in reading and writing Mandarin. Facility in the use and recognition of Chinese characters is…
Chinese-Mandarin: Basic Course. Volume IX: Lessons 88-95.
ERIC Educational Resources Information Center
Defense Language Inst., Monterey, CA.
This is the ninth of 16 volumes of audiolingual classroom instruction in Mandarin Chinese. The course is designed to train native English speakers to Level 3 Foreign Service Institute proficiency in comprehension and speaking, and to Level 2 proficiency in reading and writing Mandarin. Facility in the use and recognition of Chinese characters is…
Chinese-Mandarin: Basic Course. Volume VIII: Lessons 80-87.
ERIC Educational Resources Information Center
Defense Language Inst., Monterey, CA.
This is the eighth of 16 volumes of audiolingual classroom instruction in Mandarin Chinese. The course is designed to train native English speakers to Level 3 Foreign Service Institute proficiency in comprehension and speaking, and to Level 2 proficiency in reading and writing Mandarin. Facility in the use and recognition of Chinese characters is…
Vignally, P; Fondi, G; Taggi, F; Pitidis, A
2011-03-31
In Italy the European Union Injury Database reports the involvement of chemical products in 0.9% of home and leisure accidents. The Emergency Department registry on domestic accidents in Italy and the Poison Control Centres record that 90% of cases of exposure to toxic substances occur in the home. It is not rare for the effects of chemical agents to be observed in hospitals, with a high potential risk of damage - the rate of this cause of hospital admission is double the domestic injury average. The aim of this study was to monitor the effects of injuries caused by caustic agents in Italy using automatic free-text recognition in Emergency Department medical databases. We created a Stata software program to automatically identify caustic or corrosive injury cases using an agent-specific list of keywords. We focused attention on the procedure's sensitivity and specificity. Ten hospitals in six regions of Italy participated in the study. The program identified 112 cases of injury by caustic or corrosive agents. Checking the cases by quality controls (based on manual reading of ED reports), we assessed 99 cases as true positive, i.e. 88.4% of the patients were automatically recognized by the software as being affected by caustic substances (99% CI: 80.6%- 96.2%), that is to say 0.59% (99% CI: 0.45%-0.76%) of the whole sample of home injuries, a value almost three times as high as that expected (p < 0.0001) from European codified information. False positives were 11.6% of the recognized cases (99% CI: 5.1%- 21.5%). Our automatic procedure for caustic agent identification proved to have excellent product recognition capacity with an acceptable level of excess sensitivity. Contrary to our a priori hypothesis, the automatic recognition system provided a level of identification of agents possessing caustic effects that was significantly much greater than was predictable on the basis of the values from current codifications reported in the European Database.
A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC
Clematide, Simon; Akhondi, Saber A; van Mulligen, Erik M; Rebholz-Schuhmann, Dietrich
2015-01-01
Objective To create a multilingual gold-standard corpus for biomedical concept recognition. Materials and methods We selected text units from different parallel corpora (Medline abstract titles, drug labels, biomedical patent claims) in English, French, German, Spanish, and Dutch. Three annotators per language independently annotated the biomedical concepts, based on a subset of the Unified Medical Language System and covering a wide range of semantic groups. To reduce the annotation workload, automatically generated preannotations were provided. Individual annotations were automatically harmonized and then adjudicated, and cross-language consistency checks were carried out to arrive at the final annotations. Results The number of final annotations was 5530. Inter-annotator agreement scores indicate good agreement (median F-score 0.79), and are similar to those between individual annotators and the gold standard. The automatically generated harmonized annotation set for each language performed equally well as the best annotator for that language. Discussion The use of automatic preannotations, harmonized annotations, and parallel corpora helped to keep the manual annotation efforts manageable. The inter-annotator agreement scores provide a reference standard for gauging the performance of automatic annotation techniques. Conclusion To our knowledge, this is the first gold-standard corpus for biomedical concept recognition in languages other than English. Other distinguishing features are the wide variety of semantic groups that are being covered, and the diversity of text genres that were annotated. PMID:25948699
Sardelis, Stephanie; Drew, Joshua A.
2016-01-01
The scientific community faces numerous challenges in achieving gender equality among its participants. One method of highlighting the contributions made by female scientists is through their selection as featured speakers in symposia held at the conferences of professional societies. Because they are specially invited, symposia speakers obtain a prestigious platform from which to display their scientific research, which can elevate the recognition of female scientists. We investigated the number of female symposium speakers in two professional societies (the Society of Conservation Biology (SCB) from 1999 to 2015, and the American Society of Ichthyologists and Herpetologists (ASIH) from 2005 to 2015), in relation to the number of female symposium organizers. Overall, we found that 36.4% of symposia organizers and 31.7% of symposia speakers were women at the Society of Conservation Biology conferences, while 19.1% of organizers and 28% of speakers were women at the American Society of Ichthyologists and Herpetologists conferences. For each additional female organizer at the SCB and ASIH conferences, there was an average increase of 95% and 70% female speakers, respectively. As such, we found a significant positive relationship between the number of women organizing a symposium and the number of women speaking in that symposium. We did not, however, find a significant increase in the number of women speakers or organizers per symposium over time at either conference, suggesting a need for revitalized efforts to diversify our scientific societies. To further those ends, we suggest facilitating gender equality in professional societies by removing barriers to participation, including assisting with travel, making conferences child-friendly, and developing thorough, mandatory Codes of Conduct for all conferences. PMID:27467580
Automatic Recognition of Fetal Facial Standard Plane in Ultrasound Image via Fisher Vector.
Lei, Baiying; Tan, Ee-Leng; Chen, Siping; Zhuo, Liu; Li, Shengli; Ni, Dong; Wang, Tianfu
2015-01-01
Acquisition of the standard plane is the prerequisite of biometric measurement and diagnosis during the ultrasound (US) examination. In this paper, a new algorithm is developed for the automatic recognition of the fetal facial standard planes (FFSPs) such as the axial, coronal, and sagittal planes. Specifically, densely sampled root scale invariant feature transform (RootSIFT) features are extracted and then encoded by Fisher vector (FV). The Fisher network with multi-layer design is also developed to extract spatial information to boost the classification performance. Finally, automatic recognition of the FFSPs is implemented by support vector machine (SVM) classifier based on the stochastic dual coordinate ascent (SDCA) algorithm. Experimental results using our dataset demonstrate that the proposed method achieves an accuracy of 93.27% and a mean average precision (mAP) of 99.19% in recognizing different FFSPs. Furthermore, the comparative analyses reveal the superiority of the proposed method based on FV over the traditional methods.
Key features for ATA / ATR database design in missile systems
NASA Astrophysics Data System (ADS)
Özertem, Kemal Arda
2017-05-01
Automatic target acquisition (ATA) and automatic target recognition (ATR) are two vital tasks for missile systems, and having a robust detection and recognition algorithm is crucial for overall system performance. In order to have a robust target detection and recognition algorithm, an extensive image database is required. Automatic target recognition algorithms use the database of images in training and testing steps of algorithm. This directly affects the recognition performance, since the training accuracy is driven by the quality of the image database. In addition, the performance of an automatic target detection algorithm can be measured effectively by using an image database. There are two main ways for designing an ATA / ATR database. The first and easy way is by using a scene generator. A scene generator can model the objects by considering its material information, the atmospheric conditions, detector type and the territory. Designing image database by using a scene generator is inexpensive and it allows creating many different scenarios quickly and easily. However the major drawback of using a scene generator is its low fidelity, since the images are created virtually. The second and difficult way is designing it using real-world images. Designing image database with real-world images is a lot more costly and time consuming; however it offers high fidelity, which is critical for missile algorithms. In this paper, critical concepts in ATA / ATR database design with real-world images are discussed. Each concept is discussed in the perspective of ATA and ATR separately. For the implementation stage, some possible solutions and trade-offs for creating the database are proposed, and all proposed approaches are compared to each other with regards to their pros and cons.
Cost/benefit analysis of electronic license plates
DOT National Transportation Integrated Search
2008-06-01
The objective of this report is to determine whether electronic vehicle recognition systems (EVR) or automatic license plate recognition systems (ALPR) would be beneficial to the Arizona Department of Transportation (AZDOT). EVR uses radio frequency ...
An adaptive Hidden Markov Model for activity recognition based on a wearable multi-sensor device
USDA-ARS?s Scientific Manuscript database
Human activity recognition is important in the study of personal health, wellness and lifestyle. In order to acquire human activity information from the personal space, many wearable multi-sensor devices have been developed. In this paper, a novel technique for automatic activity recognition based o...
Tension between scientific certainty and meaning complicates communication of IPCC reports
NASA Astrophysics Data System (ADS)
Hollin, G. J. S.; Pearce, W.
2015-08-01
Here we demonstrate that speakers at the press conference for the publication of the IPCC’s Fifth Assessment Report (Working Group 1; ref. ) attempted to make the documented level of certainty of anthropogenic global warming (AGW) more meaningful to the public. Speakers attempted to communicate this through reference to short-term temperature increases. However, when journalists enquired about the similarly short `pause’ in global temperature increase, the speakers dismissed the relevance of such timescales, thus becoming incoherent as to `what counts’ as scientific evidence for AGW. We call this the `IPCC’s certainty trap’. This incoherence led to confusion within the press conference and subsequent condemnation in the media. The speakers were well intentioned in their attempts to communicate the public implications of the report, but these attempts threatened to erode their scientific credibility. In this instance, the certainty trap was the result of the speakers’ failure to acknowledge the tensions between scientific and public meanings. Avoiding the certainty trap in the future will require a nuanced accommodation of uncertainties and a recognition that rightful demands for scientific credibility need to be balanced with public and political dialogue about the things we value and the actions we take to protect those things.
Vieira, Manuel; Fonseca, Paulo J; Amorim, M Clara P; Teixeira, Carlos J C
2015-12-01
The study of acoustic communication in animals often requires not only the recognition of species specific acoustic signals but also the identification of individual subjects, all in a complex acoustic background. Moreover, when very long recordings are to be analyzed, automatic recognition and identification processes are invaluable tools to extract the relevant biological information. A pattern recognition methodology based on hidden Markov models is presented inspired by successful results obtained in the most widely known and complex acoustical communication signal: human speech. This methodology was applied here for the first time to the detection and recognition of fish acoustic signals, specifically in a stream of round-the-clock recordings of Lusitanian toadfish (Halobatrachus didactylus) in their natural estuarine habitat. The results show that this methodology is able not only to detect the mating sounds (boatwhistles) but also to identify individual male toadfish, reaching an identification rate of ca. 95%. Moreover this method also proved to be a powerful tool to assess signal durations in large data sets. However, the system failed in recognizing other sound types.
ERIC Educational Resources Information Center
Adel, Annelie; Erman, Britt
2012-01-01
In order for discourse to be considered idiomatic, it needs to exhibit features like fluency and pragmatically appropriate language use. Advances in corpus linguistics make it possible to examine idiomaticity from the perspective of recurrent word combinations. One approach to capture such word combinations is by the automatic retrieval of lexical…
Google Home: smart speaker as environmental control unit.
Noda, Kenichiro
2017-08-23
Environmental Control Units (ECU) are devices or a system that allows a person to control appliances in their home or work environment. Such system can be utilized by clients with physical and/or functional disability to enhance their ability to control their environment, to promote independence and improve their quality of life. Over the last several years, there have been an emergence of several inexpensive, commercially-available, voice activated smart speakers into the market such as Google Home and Amazon Echo. These smart speakers are equipped with far field microphone that supports voice recognition, and allows for complete hand-free operation for various purposes, including for playing music, for information retrieval, and most importantly, for environmental control. Clients with disability could utilize these features to turn the unit into a simple ECU that is completely voice activated and wirelessly connected to appliances. Smart speakers, with their ease of setup, low cost and versatility, may be a more affordable and accessible alternative to the traditional ECU. Implications for Rehabilitation Environmental Control Units (ECU) enable independence for physically and functionally disabled clients, and reduce burden and frequency of demands on carers. Traditional ECU can be costly and may require clients to learn specialized skills to use. Smart speakers have the potential to be used as a new-age ECU by overcoming these barriers, and can be used by a wider range of clients.
Prosody's Contribution to Fluency: An Examination of the Theory of Automatic Information Processing
ERIC Educational Resources Information Center
Schrauben, Julie E.
2010-01-01
LaBerge and Samuels' (1974) theory of automatic information processing in reading offers a model that explains how and where the processing of information occurs and the degree to which processing of information occurs. These processes are dependent upon two criteria: accurate word decoding and automatic word recognition. However, LaBerge and…
Adaptation to novel accents by toddlers
White, Katherine S.; Aslin, Richard N.
2010-01-01
Word recognition is a balancing act: listeners must be sensitive to phonetic detail to avoid confusing similar words, yet, at the same time, be flexible enough to adapt to phonetically variable pronunciations, such as those produced by speakers of different dialects or by non-native speakers. Recent work has demonstrated that young toddlers are sensitive to phonetic detail during word recognition; pronunciations that deviate from the typical phonological form lead to a disruption of processing. However, it is not known whether young word learners show the flexibility that is characteristic of adult word recognition. The present study explores whether toddlers can adapt to artificial accents in which there is a vowel category shift with respect to the native language. 18–20-month-olds heard mispronunciations of familiar words (e.g., vowels were shifted from [a] to [æ]: “dog” pronounced as “dag”). In test, toddlers were tolerant of mispronunciations if they had recently been exposed to the same vowel shift, but not if they had been exposed to standard pronunciations or other vowel shifts. The effects extended beyond particular items heard in exposure to words sharing the same vowels. These results indicate that, like adults, toddlers show flexibility in their interpretation of phonological detail. Moreover, they suggest that effects of top-down knowledge on the reinterpretation of phonological detail generalize across the phono-lexical system. PMID:21479106
Face Averages Enhance User Recognition for Smartphone Security
Robertson, David J.; Kramer, Robin S. S.; Burton, A. Mike
2015-01-01
Our recognition of familiar faces is excellent, and generalises across viewing conditions. However, unfamiliar face recognition is much poorer. For this reason, automatic face recognition systems might benefit from incorporating the advantages of familiarity. Here we put this to the test using the face verification system available on a popular smartphone (the Samsung Galaxy). In two experiments we tested the recognition performance of the smartphone when it was encoded with an individual’s ‘face-average’ – a representation derived from theories of human face perception. This technique significantly improved performance for both unconstrained celebrity images (Experiment 1) and for real faces (Experiment 2): users could unlock their phones more reliably when the device stored an average of the user’s face than when they stored a single image. This advantage was consistent across a wide variety of everyday viewing conditions. Furthermore, the benefit did not reduce the rejection of imposter faces. This benefit is brought about solely by consideration of suitable representations for automatic face recognition, and we argue that this is just as important as development of matching algorithms themselves. We propose that this representation could significantly improve recognition rates in everyday settings. PMID:25807251
Jung, Jaehoon; Yoon, Inhye; Paik, Joonki
2016-01-01
This paper presents an object occlusion detection algorithm using object depth information that is estimated by automatic camera calibration. The object occlusion problem is a major factor to degrade the performance of object tracking and recognition. To detect an object occlusion, the proposed algorithm consists of three steps: (i) automatic camera calibration using both moving objects and a background structure; (ii) object depth estimation; and (iii) detection of occluded regions. The proposed algorithm estimates the depth of the object without extra sensors but with a generic red, green and blue (RGB) camera. As a result, the proposed algorithm can be applied to improve the performance of object tracking and object recognition algorithms for video surveillance systems. PMID:27347978
SAM: speech-aware applications in medicine to support structured data entry.
Wormek, A. K.; Ingenerf, J.; Orthner, H. F.
1997-01-01
In the last two years, improvement in speech recognition technology has directed the medical community's interest to porting and using such innovations in clinical systems. The acceptance of speech recognition systems in clinical domains increases with recognition speed, large medical vocabulary, high accuracy, continuous speech recognition, and speaker independence. Although some commercial speech engines approach these requirements, the greatest benefit can be achieved in adapting a speech recognizer to a specific medical application. The goals of our work are first, to develop a speech-aware core component which is able to establish connections to speech recognition engines of different vendors. This is realized in SAM. Second, with applications based on SAM we want to support the physician in his/her routine clinical care activities. Within the STAMP project (STAndardized Multimedia report generator in Pathology), we extend SAM by combining a structured data entry approach with speech recognition technology. Another speech-aware application in the field of Diabetes care is connected to a terminology server. The server delivers a controlled vocabulary which can be used for speech recognition. PMID:9357730
Photonic correlator pattern recognition: Application to autonomous docking
NASA Technical Reports Server (NTRS)
Sjolander, Gary W.
1991-01-01
Optical correlators for real-time automatic pattern recognition applications have recently become feasible due to advances in high speed devices and filter formulation concepts. The devices are discussed in the context of their use in autonomous docking.
Measurement Marker Recognition In A Time Sequence Of Infrared Images For Biomedical Applications
NASA Astrophysics Data System (ADS)
Fiorini, A. R.; Fumero, R.; Marchesi, R.
1986-03-01
In thermographic measurements, quantitative surface temperature evaluation is often uncertain. The main reason is in the lack of available reference points in transient conditions. Reflective markers were used for automatic marker recognition and pixel coordinate computations. An algorithm selects marker icons to match marker references where particular luminance conditions are satisfied. Automatic marker recognition allows luminance compensation and temperature calibration of recorded infrared images. A biomedical application is presented: the dynamic behaviour of the surface temperature distributions is investigated in order to study the performance of two different pumping systems for extracorporeal circulation. Sequences of images are compared and results are discussed. Finally, the algorithm allows to monitor the experimental environment and to alert for the presence of unusual experimental conditions.
The influence of ambient speech on adult speech productions through unintentional imitation.
Delvaux, Véronique; Soquet, Alain
2007-01-01
This paper deals with the influence of ambient speech on individual speech productions. A methodological framework is defined to gather the experimental data necessary to feed computer models simulating self-organisation in phonological systems. Two experiments were carried out. Experiment 1 was run on French native speakers from two regiolects of Belgium: two from Liège and two from Brussels. When exposed to the way of speaking of the other regiolect via loudspeakers, the speakers of one regiolect produced vowels that were significantly different from their typical realisations, and significantly closer to the way of speaking specific of the other regiolect. Experiment 2 achieved a replication of the results for 8 Mons speakers hearing a Liège speaker. A significant part of the imitative effect remained up to 10 min after the end of the exposure to the other regiolect productions. As a whole, the results suggest that: (i) imitation occurs automatically and unintentionally, (ii) the modified realisations leave a memory trace, in which case the mechanism may be better defined as 'mimesis' than as 'imitation'. The potential effects of multiple imitative speech interactions on sound change are discussed in this paper, as well as the implications for a general theory of phonetic implementation and phonetic representation.
Influence of encoding focus and stereotypes on source monitoring event-related-potentials.
Leynes, P Andrew; Nagovsky, Irina
2016-01-01
Source memory, memory for the origin of a memory, can be influenced by stereotypes and the information of focus during encoding processes. Participants studied words from two different speakers (male or female) using self-focus or other-focus encoding. Source judgments for the speaker׳s voice and Event-Related Potentials (ERPs) were recorded during test. Self-focus encoding increased dependence on stereotype information and the Late Posterior Negativity (LPN). The results link the LPN with an increase in systematic decision processes such as consulting prior knowledge to support an episodic memory judgment. In addition, other-focus encoding increased conditional source judgments and resulted in weaker old/new recognition relative to the self-focus encoding. The putative correlate of recollection (LPC) was absent during this condition and this was taken as evidence that recollection of partial information supported source judgments. Collectively, the results suggest that other-focus encoding changes source monitoring processing by altering the weight of specific memory features. Copyright © 2015 Elsevier B.V. All rights reserved.
Automatic speech recognition using a predictive echo state network classifier.
Skowronski, Mark D; Harris, John G
2007-04-01
We have combined an echo state network (ESN) with a competitive state machine framework to create a classification engine called the predictive ESN classifier. We derive the expressions for training the predictive ESN classifier and show that the model was significantly more noise robust compared to a hidden Markov model in noisy speech classification experiments by 8+/-1 dB signal-to-noise ratio. The simple training algorithm and noise robustness of the predictive ESN classifier make it an attractive classification engine for automatic speech recognition.
Speech processing using maximum likelihood continuity mapping
Hogden, John E.
2000-01-01
Speech processing is obtained that, given a probabilistic mapping between static speech sounds and pseudo-articulator positions, allows sequences of speech sounds to be mapped to smooth sequences of pseudo-articulator positions. In addition, a method for learning a probabilistic mapping between static speech sounds and pseudo-articulator position is described. The method for learning the mapping between static speech sounds and pseudo-articulator position uses a set of training data composed only of speech sounds. The said speech processing can be applied to various speech analysis tasks, including speech recognition, speaker recognition, speech coding, speech synthesis, and voice mimicry.
Speech processing using maximum likelihood continuity mapping
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hogden, J.E.
Speech processing is obtained that, given a probabilistic mapping between static speech sounds and pseudo-articulator positions, allows sequences of speech sounds to be mapped to smooth sequences of pseudo-articulator positions. In addition, a method for learning a probabilistic mapping between static speech sounds and pseudo-articulator position is described. The method for learning the mapping between static speech sounds and pseudo-articulator position uses a set of training data composed only of speech sounds. The said speech processing can be applied to various speech analysis tasks, including speech recognition, speaker recognition, speech coding, speech synthesis, and voice mimicry.
Quantity Recognition among Speakers of an Anumeric Language
ERIC Educational Resources Information Center
Everett, Caleb; Madora, Keren
2012-01-01
Recent research has suggested that the Piraha, an Amazonian tribe with a number-less language, are able to match quantities greater than 3 if the matching task does not require recall or spatial transposition. This finding contravenes previous work among the Piraha. In this study, we re-tested the Pirahas' performance in the crucial one-to-one…
The Storage and Processing of Morphologically Complex Words in L2 Spanish
ERIC Educational Resources Information Center
Foote, Rebecca
2017-01-01
Research with native speakers indicates that, during word recognition, regularly inflected words undergo parsing that segments them into stems and affixes. In contrast, studies with learners suggest that this parsing may not take place in L2. This study's research questions are: Do L2 Spanish learners store and process regularly inflected,…
ERIC Educational Resources Information Center
Liu, Fang; Xu, Yi; Patel, Aniruddh D.; Francart, Tom; Jiang, Cunmei
2012-01-01
This study examined whether "melodic contour deafness" (insensitivity to the direction of pitch movement) in congenital amusia is associated with specific types of pitch patterns (discrete versus gliding pitches) or stimulus types (speech syllables versus complex tones). Thresholds for identification of pitch direction were obtained using discrete…
ERIC Educational Resources Information Center
Treurniet, William
A study applied artificial neural networks, trained with the back-propagation learning algorithm, to modelling phonemes extracted from the DARPA TIMIT multi-speaker, continuous speech data base. A number of proposed network architectures were applied to the phoneme classification task, ranging from the simple feedforward multilayer network to more…
American or British? L2 Speakers' Recognition and Evaluations of Accent Features in English
ERIC Educational Resources Information Center
Carrie, Erin; McKenzie, Robert M.
2018-01-01
Recent language attitude research has attended to the processes involved in identifying and evaluating spoken language varieties. This article investigates the ability of second-language learners of English in Spain (N = 71) to identify Received Pronunciation (RP) and General American (GenAm) speech and their perceptions of linguistic variation…
ERIC Educational Resources Information Center
Malins, Jeffrey G.; Joanisse, Marc F.
2012-01-01
We investigated the influences of phonological similarity on the time course of spoken word processing in Mandarin Chinese. Event related potentials were recorded while adult native speakers of Mandarin ("N" = 19) judged whether auditory words matched or mismatched visually presented pictures. Mismatching words were of the following…
Experimental Pragmatics and What Is Said: A Response to Gibbs and Moise.
ERIC Educational Resources Information Center
Nicolle, Steve; Clark, Billy
1999-01-01
Attempted replication of Gibbs and Moise (1997) experiments regarding the recognition of a distinction between what is said and what is implicated. Results showed that, under certain conditions, subject selected implicatures when asked to select the paraphrase best reflecting what a speaker has said. Suggests that results can be explained with the…
Costs and Effects of Dual-Language Immersion in the Portland Public Schools
ERIC Educational Resources Information Center
Steele, Jennifer L.; Slater, Robert; Li, Jennifer; Zamarro, Gema; Miller, Trey
2015-01-01
Though it is estimated that about half of the world's population is bilingual, the estimate for the United States is well below 20% (Grosjean, 2010). Amid growing recognition of the need for second language skills to facilitate international commerce and national security and to enhance learning opportunities for non-native speakers of English,…
Thai Automatic Speech Recognition
2005-01-01
used in an external DARPA evaluation involving medical scenarios between an American Doctor and a naïve monolingual Thai patient. 2. Thai Language... dictionary generation more challenging, and (3) the lack of word segmentation, which calls for automatic segmentation approaches to make n-gram language...requires a dictionary and provides various segmentation algorithms to automatically select suitable segmentations. Here we used a maximal matching
Reviewing the connection between speech and obstructive sleep apnea.
Espinoza-Cuadros, Fernando; Fernández-Pozo, Rubén; Toledano, Doroteo T; Alcázar-Ramírez, José D; López-Gonzalo, Eduardo; Hernández-Gómez, Luis A
2016-02-20
Sleep apnea (OSA) is a common sleep disorder characterized by recurring breathing pauses during sleep caused by a blockage of the upper airway (UA). The altered UA structure or function in OSA speakers has led to hypothesize the automatic analysis of speech for OSA assessment. In this paper we critically review several approaches using speech analysis and machine learning techniques for OSA detection, and discuss the limitations that can arise when using machine learning techniques for diagnostic applications. A large speech database including 426 male Spanish speakers suspected to suffer OSA and derived to a sleep disorders unit was used to study the clinical validity of several proposals using machine learning techniques to predict the apnea-hypopnea index (AHI) or classify individuals according to their OSA severity. AHI describes the severity of patients' condition. We first evaluate AHI prediction using state-of-the-art speaker recognition technologies: speech spectral information is modelled using supervectors or i-vectors techniques, and AHI is predicted through support vector regression (SVR). Using the same database we then critically review several OSA classification approaches previously proposed. The influence and possible interference of other clinical variables or characteristics available for our OSA population: age, height, weight, body mass index, and cervical perimeter, are also studied. The poor results obtained when estimating AHI using supervectors or i-vectors followed by SVR contrast with the positive results reported by previous research. This fact prompted us to a careful review of these approaches, also testing some reported results over our database. Several methodological limitations and deficiencies were detected that may have led to overoptimistic results. The methodological deficiencies observed after critically reviewing previous research can be relevant examples of potential pitfalls when using machine learning techniques for diagnostic applications. We have found two common limitations that can explain the likelihood of false discovery in previous research: (1) the use of prediction models derived from sources, such as speech, which are also correlated with other patient characteristics (age, height, sex,…) that act as confounding factors; and (2) overfitting of feature selection and validation methods when working with a high number of variables compared to the number of cases. We hope this study could not only be a useful example of relevant issues when using machine learning for medical diagnosis, but it will also help in guiding further research on the connection between speech and OSA.
Chen, Yibing; Ogata, Taiki; Ueyama, Tsuyoshi; Takada, Toshiyuki; Ota, Jun
2018-01-01
Machine vision is playing an increasingly important role in industrial applications, and the automated design of image recognition systems has been a subject of intense research. This study has proposed a system for automatically designing the field-of-view (FOV) of a camera, the illumination strength and the parameters in a recognition algorithm. We formulated the design problem as an optimisation problem and used an experiment based on a hierarchical algorithm to solve it. The evaluation experiments using translucent plastics objects showed that the use of the proposed system resulted in an effective solution with a wide FOV, recognition of all objects and 0.32 mm and 0.4° maximal positional and angular errors when all the RGB (red, green and blue) for illumination and R channel image for recognition were used. Though all the RGB illumination and grey scale images also provided recognition of all the objects, only a narrow FOV was selected. Moreover, full recognition was not achieved by using only G illumination and a grey-scale image. The results showed that the proposed method can automatically design the FOV, illumination and parameters in the recognition algorithm and that tuning all the RGB illumination is desirable even when single-channel or grey-scale images are used for recognition. PMID:29786665
Chen, Yibing; Ogata, Taiki; Ueyama, Tsuyoshi; Takada, Toshiyuki; Ota, Jun
2018-05-22
Machine vision is playing an increasingly important role in industrial applications, and the automated design of image recognition systems has been a subject of intense research. This study has proposed a system for automatically designing the field-of-view (FOV) of a camera, the illumination strength and the parameters in a recognition algorithm. We formulated the design problem as an optimisation problem and used an experiment based on a hierarchical algorithm to solve it. The evaluation experiments using translucent plastics objects showed that the use of the proposed system resulted in an effective solution with a wide FOV, recognition of all objects and 0.32 mm and 0.4° maximal positional and angular errors when all the RGB (red, green and blue) for illumination and R channel image for recognition were used. Though all the RGB illumination and grey scale images also provided recognition of all the objects, only a narrow FOV was selected. Moreover, full recognition was not achieved by using only G illumination and a grey-scale image. The results showed that the proposed method can automatically design the FOV, illumination and parameters in the recognition algorithm and that tuning all the RGB illumination is desirable even when single-channel or grey-scale images are used for recognition.
Emotion Analysis of Telephone Complaints from Customer Based on Affective Computing.
Gong, Shuangping; Dai, Yonghui; Ji, Jun; Wang, Jinzhao; Sun, Hai
2015-01-01
Customer complaint has been the important feedback for modern enterprises to improve their product and service quality as well as the customer's loyalty. As one of the commonly used manners in customer complaint, telephone communication carries rich emotional information of speeches, which provides valuable resources for perceiving the customer's satisfaction and studying the complaint handling skills. This paper studies the characteristics of telephone complaint speeches and proposes an analysis method based on affective computing technology, which can recognize the dynamic changes of customer emotions from the conversations between the service staff and the customer. The recognition process includes speaker recognition, emotional feature parameter extraction, and dynamic emotion recognition. Experimental results show that this method is effective and can reach high recognition rates of happy and angry states. It has been successfully applied to the operation quality and service administration in telecom and Internet service company.
Emotion Analysis of Telephone Complaints from Customer Based on Affective Computing
Gong, Shuangping; Ji, Jun; Wang, Jinzhao; Sun, Hai
2015-01-01
Customer complaint has been the important feedback for modern enterprises to improve their product and service quality as well as the customer's loyalty. As one of the commonly used manners in customer complaint, telephone communication carries rich emotional information of speeches, which provides valuable resources for perceiving the customer's satisfaction and studying the complaint handling skills. This paper studies the characteristics of telephone complaint speeches and proposes an analysis method based on affective computing technology, which can recognize the dynamic changes of customer emotions from the conversations between the service staff and the customer. The recognition process includes speaker recognition, emotional feature parameter extraction, and dynamic emotion recognition. Experimental results show that this method is effective and can reach high recognition rates of happy and angry states. It has been successfully applied to the operation quality and service administration in telecom and Internet service company. PMID:26633967
V2S: Voice to Sign Language Translation System for Malaysian Deaf People
NASA Astrophysics Data System (ADS)
Mean Foong, Oi; Low, Tang Jung; La, Wai Wan
The process of learning and understand the sign language may be cumbersome to some, and therefore, this paper proposes a solution to this problem by providing a voice (English Language) to sign language translation system using Speech and Image processing technique. Speech processing which includes Speech Recognition is the study of recognizing the words being spoken, regardless of whom the speaker is. This project uses template-based recognition as the main approach in which the V2S system first needs to be trained with speech pattern based on some generic spectral parameter set. These spectral parameter set will then be stored as template in a database. The system will perform the recognition process through matching the parameter set of the input speech with the stored templates to finally display the sign language in video format. Empirical results show that the system has 80.3% recognition rate.
Method for automatic detection of wheezing in lung sounds.
Riella, R J; Nohama, P; Maia, J M
2009-07-01
The present report describes the development of a technique for automatic wheezing recognition in digitally recorded lung sounds. This method is based on the extraction and processing of spectral information from the respiratory cycle and the use of these data for user feedback and automatic recognition. The respiratory cycle is first pre-processed, in order to normalize its spectral information, and its spectrogram is then computed. After this procedure, the spectrogram image is processed by a two-dimensional convolution filter and a half-threshold in order to increase the contrast and isolate its highest amplitude components, respectively. Thus, in order to generate more compressed data to automatic recognition, the spectral projection from the processed spectrogram is computed and stored as an array. The higher magnitude values of the array and its respective spectral values are then located and used as inputs to a multi-layer perceptron artificial neural network, which results an automatic indication about the presence of wheezes. For validation of the methodology, lung sounds recorded from three different repositories were used. The results show that the proposed technique achieves 84.82% accuracy in the detection of wheezing for an isolated respiratory cycle and 92.86% accuracy for the detection of wheezes when detection is carried out using groups of respiratory cycles obtained from the same person. Also, the system presents the original recorded sound and the post-processed spectrogram image for the user to draw his own conclusions from the data.
A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC.
Kors, Jan A; Clematide, Simon; Akhondi, Saber A; van Mulligen, Erik M; Rebholz-Schuhmann, Dietrich
2015-09-01
To create a multilingual gold-standard corpus for biomedical concept recognition. We selected text units from different parallel corpora (Medline abstract titles, drug labels, biomedical patent claims) in English, French, German, Spanish, and Dutch. Three annotators per language independently annotated the biomedical concepts, based on a subset of the Unified Medical Language System and covering a wide range of semantic groups. To reduce the annotation workload, automatically generated preannotations were provided. Individual annotations were automatically harmonized and then adjudicated, and cross-language consistency checks were carried out to arrive at the final annotations. The number of final annotations was 5530. Inter-annotator agreement scores indicate good agreement (median F-score 0.79), and are similar to those between individual annotators and the gold standard. The automatically generated harmonized annotation set for each language performed equally well as the best annotator for that language. The use of automatic preannotations, harmonized annotations, and parallel corpora helped to keep the manual annotation efforts manageable. The inter-annotator agreement scores provide a reference standard for gauging the performance of automatic annotation techniques. To our knowledge, this is the first gold-standard corpus for biomedical concept recognition in languages other than English. Other distinguishing features are the wide variety of semantic groups that are being covered, and the diversity of text genres that were annotated. © The Author 2015. Published by Oxford University Press on behalf of the American Medical Informatics Association.
[Research on automatic external defibrillator based on DSP].
Jing, Jun; Ding, Jingyan; Zhang, Wei; Hong, Wenxue
2012-10-01
Electrical defibrillation is the most effective way to treat the ventricular tachycardia (VT) and ventricular fibrillation (VF). An automatic external defibrillator based on DSP is introduced in this paper. The whole design consists of the signal collection module, the microprocessor controlingl module, the display module, the defibrillation module and the automatic recognition algorithm for VF and non VF, etc. This automatic external defibrillator has achieved goals such as ECG signal real-time acquisition, ECG wave synchronous display, data delivering to U disk and automatic defibrillate when shockable rhythm appears, etc.
ERIC Educational Resources Information Center
Suzuki, Yuichi; DeKeyser, Robert
2017-01-01
Recent research has called for the use of fine-grained measures that distinguish implicit knowledge from automatized explicit knowledge. In the current study, such measures were used to determine how the two systems interact in a naturalistic second language (L2) acquisition context. One hundred advanced L2 speakers of Japanese living in Japan…
ERIC Educational Resources Information Center
Yee, Penny L.
This study investigates the role of specific inhibitory processes in lexical ambiguity resolution. An attentional view of inhibition and a view based on specific automatic inhibition between nodes predict different results when a neutral item is processed between an ambiguous word and a related target. Subjects were 32 English speakers with normal…
A System for Mailpiece ZIP Code Assignment through Contextual Analysis. Phase 2
1991-03-01
Segmentation Address Block Interpretation Automatic Feature Generation Word Recognition Feature Detection Word Verification Optical Character Recognition Directory...in the Phase III effort. 1.1 Motivation The United States Postal Service (USPS) deploys large numbers of optical character recognition (OCR) machines...4):208-218, November 1986. [2] Gronmeyer, L. K., Ruffin, B. W., Lybanon, M. A., Neely, P. L., and Pierce, S. E. An Overview of Optical Character Recognition (OCR
The development of cross-cultural recognition of vocal emotion during childhood and adolescence.
Chronaki, Georgia; Wigelsworth, Michael; Pell, Marc D; Kotz, Sonja A
2018-06-14
Humans have an innate set of emotions recognised universally. However, emotion recognition also depends on socio-cultural rules. Although adults recognise vocal emotions universally, they identify emotions more accurately in their native language. We examined developmental trajectories of universal vocal emotion recognition in children. Eighty native English speakers completed a vocal emotion recognition task in their native language (English) and foreign languages (Spanish, Chinese, and Arabic) expressing anger, happiness, sadness, fear, and neutrality. Emotion recognition was compared across 8-to-10, 11-to-13-year-olds, and adults. Measures of behavioural and emotional problems were also taken. Results showed that although emotion recognition was above chance for all languages, native English speaking children were more accurate in recognising vocal emotions in their native language. There was a larger improvement in recognising vocal emotion from the native language during adolescence. Vocal anger recognition did not improve with age for the non-native languages. This is the first study to demonstrate universality of vocal emotion recognition in children whilst supporting an "in-group advantage" for more accurate recognition in the native language. Findings highlight the role of experience in emotion recognition, have implications for child development in modern multicultural societies and address important theoretical questions about the nature of emotions.
Ben Younes, Lassad; Nakajima, Yoshikazu; Saito, Toki
2014-03-01
Femur segmentation is well established and widely used in computer-assisted orthopedic surgery. However, most of the robust segmentation methods such as statistical shape models (SSM) require human intervention to provide an initial position for the SSM. In this paper, we propose to overcome this problem and provide a fully automatic femur segmentation method for CT images based on primitive shape recognition and SSM. Femur segmentation in CT scans was performed using primitive shape recognition based on a robust algorithm such as the Hough transform and RANdom SAmple Consensus. The proposed method is divided into 3 steps: (1) detection of the femoral head as sphere and the femoral shaft as cylinder in the SSM and the CT images, (2) rigid registration between primitives of SSM and CT image to initialize the SSM into the CT image, and (3) fitting of the SSM to the CT image edge using an affine transformation followed by a nonlinear fitting. The automated method provided good results even with a high number of outliers. The difference of segmentation error between the proposed automatic initialization method and a manual initialization method is less than 1 mm. The proposed method detects primitive shape position to initialize the SSM into the target image. Based on primitive shapes, this method overcomes the problem of inter-patient variability. Moreover, the results demonstrate that our method of primitive shape recognition can be used for 3D SSM initialization to achieve fully automatic segmentation of the femur.
Automatic recognition of lactating sow behaviors through depth image processing
USDA-ARS?s Scientific Manuscript database
Manual observation and classification of animal behaviors is laborious, time-consuming, and of limited ability to process large amount of data. A computer vision-based system was developed that automatically recognizes sow behaviors (lying, sitting, standing, kneeling, feeding, drinking, and shiftin...
Weber, Kirsten; Luther, Lisa; Indefrey, Peter; Hagoort, Peter
2016-05-01
When we learn a second language later in life, do we integrate it with the established neural networks in place for the first language or is at least a partially new network recruited? While there is evidence that simple grammatical structures in a second language share a system with the native language, the story becomes more multifaceted for complex sentence structures. In this study, we investigated the underlying brain networks in native speakers compared with proficient second language users while processing complex sentences. As hypothesized, complex structures were processed by the same large-scale inferior frontal and middle temporal language networks of the brain in the second language, as seen in native speakers. These effects were seen both in activations and task-related connectivity patterns. Furthermore, the second language users showed increased task-related connectivity from inferior frontal to inferior parietal regions of the brain, regions related to attention and cognitive control, suggesting less automatic processing for these structures in a second language.
Three-dimensional model-based object recognition and segmentation in cluttered scenes.
Mian, Ajmal S; Bennamoun, Mohammed; Owens, Robyn
2006-10-01
Viewpoint independent recognition of free-form objects and their segmentation in the presence of clutter and occlusions is a challenging task. We present a novel 3D model-based algorithm which performs this task automatically and efficiently. A 3D model of an object is automatically constructed offline from its multiple unordered range images (views). These views are converted into multidimensional table representations (which we refer to as tensors). Correspondences are automatically established between these views by simultaneously matching the tensors of a view with those of the remaining views using a hash table-based voting scheme. This results in a graph of relative transformations used to register the views before they are integrated into a seamless 3D model. These models and their tensor representations constitute the model library. During online recognition, a tensor from the scene is simultaneously matched with those in the library by casting votes. Similarity measures are calculated for the model tensors which receive the most votes. The model with the highest similarity is transformed to the scene and, if it aligns accurately with an object in the scene, that object is declared as recognized and is segmented. This process is repeated until the scene is completely segmented. Experiments were performed on real and synthetic data comprised of 55 models and 610 scenes and an overall recognition rate of 95 percent was achieved. Comparison with the spin images revealed that our algorithm is superior in terms of recognition rate and efficiency.
Seeing a singer helps comprehension of the song's lyrics.
Jesse, Alexandra; Massaro, Dominic W
2010-06-01
When listening to speech, we often benefit when also seeing the speaker's face. If this advantage is not domain specific for speech, the recognition of sung lyrics should also benefit from seeing the singer's face. By independently varying the sight and sound of the lyrics, we found a substantial comprehension benefit of seeing a singer. This benefit was robust across participants, lyrics, and repetition of the test materials. This benefit was much larger than the benefit for sung lyrics obtained in previous research, which had not provided the visual information normally present in singing. Given that the comprehension of sung lyrics benefits from seeing the singer, just like speech comprehension benefits from seeing the speaker, both speech and music perception appear to be multisensory processes.
Affective processing in bilingual speakers: disembodied cognition?
Pavlenko, Aneta
2012-01-01
A recent study by Keysar, Hayakawa, and An (2012) suggests that "thinking in a foreign language" may reduce decision biases because a foreign language provides a greater emotional distance than a native tongue. The possibility of such "disembodied" cognition is of great interest for theories of affect and cognition and for many other areas of psychological theory and practice, from clinical and forensic psychology to marketing, but first this claim needs to be properly evaluated. The purpose of this review is to examine the findings of clinical, introspective, cognitive, psychophysiological, and neuroimaging studies of affective processing in bilingual speakers in order to identify converging patterns of results, to evaluate the claim about "disembodied cognition," and to outline directions for future inquiry. The findings to date reveal two interrelated processing effects. First-language (L1) advantage refers to increased automaticity of affective processing in the L1 and heightened electrodermal reactivity to L1 emotion-laden words. Second-language (L2) advantage refers to decreased automaticity of affective processing in the L2, which reduces interference effects and lowers electrodermal reactivity to negative emotional stimuli. The differences in L1 and L2 affective processing suggest that in some bilingual speakers, in particular late bilinguals and foreign language users, respective languages may be differentially embodied, with the later learned language processed semantically but not affectively. This difference accounts for the reduction of framing biases in L2 processing in the study by Keysar et al. (2012). The follow-up discussion identifies the limits of the findings to date in terms of participant populations, levels of processing, and types of stimuli, puts forth alternative explanations of the documented effects, and articulates predictions to be tested in future research.
The emotional impact of being myself: Emotions and foreign-language processing.
Ivaz, Lela; Costa, Albert; Duñabeitia, Jon Andoni
2016-03-01
Native languages are acquired in emotionally rich contexts, whereas foreign languages are typically acquired in emotionally neutral academic environments. As a consequence of this difference, it has been suggested that bilinguals' emotional reactivity in foreign-language contexts is reduced as compared with native language contexts. In the current study, we investigated whether this emotional distance associated with foreign languages could modulate automatic responses to self-related linguistic stimuli. Self-related stimuli enhance performance by boosting memory, speed, and accuracy as compared with stimuli unrelated to the self (the so-called self-bias effect). We explored whether this effect depends on the language context by comparing self-biases in a native and a foreign language. Two experiments were conducted with native Spanish speakers with a high level of English proficiency in which they were asked to complete a perceptual matching task during which they associated simple geometric shapes (circles, squares, and triangles) with the labels "you," "friend," and "other" either in their native or foreign language. Results showed a robust asymmetry in the self-bias in the native- and foreign-language contexts: A larger self-bias was found in the native than in the foreign language. An additional control experiment demonstrated that the same materials administered to a group of native English speakers yielded robust self-bias effects that were comparable in magnitude to the ones obtained with the Spanish speakers when tested in their native language (but not in their foreign language). We suggest that the emotional distance evoked by the foreign-language contexts caused these differential effects across language contexts. These results demonstrate that the foreign-language effects are pervasive enough to affect automatic stages of emotional processing. (c) 2016 APA, all rights reserved).
iFER: facial expression recognition using automatically selected geometric eye and eyebrow features
NASA Astrophysics Data System (ADS)
Oztel, Ismail; Yolcu, Gozde; Oz, Cemil; Kazan, Serap; Bunyak, Filiz
2018-03-01
Facial expressions have an important role in interpersonal communications and estimation of emotional states or intentions. Automatic recognition of facial expressions has led to many practical applications and became one of the important topics in computer vision. We present a facial expression recognition system that relies on geometry-based features extracted from eye and eyebrow regions of the face. The proposed system detects keypoints on frontal face images and forms a feature set using geometric relationships among groups of detected keypoints. Obtained feature set is refined and reduced using the sequential forward selection (SFS) algorithm and fed to a support vector machine classifier to recognize five facial expression classes. The proposed system, iFER (eye-eyebrow only facial expression recognition), is robust to lower face occlusions that may be caused by beards, mustaches, scarves, etc. and lower face motion during speech production. Preliminary experiments on benchmark datasets produced promising results outperforming previous facial expression recognition studies using partial face features, and comparable results to studies using whole face information, only slightly lower by ˜ 2.5 % compared to the best whole face facial recognition system while using only ˜ 1 / 3 of the facial region.
The Automatic Recognition of the Abnormal Sky-subtraction Spectra Based on Hadoop
NASA Astrophysics Data System (ADS)
An, An; Pan, Jingchang
2017-10-01
The skylines, superimposing on the target spectrum as a main noise, If the spectrum still contains a large number of high strength skylight residuals after sky-subtraction processing, it will not be conducive to the follow-up analysis of the target spectrum. At the same time, the LAMOST can observe a quantity of spectroscopic data in every night. We need an efficient platform to proceed the recognition of the larger numbers of abnormal sky-subtraction spectra quickly. Hadoop, as a distributed parallel data computing platform, can deal with large amounts of data effectively. In this paper, we conduct the continuum normalization firstly and then a simple and effective method will be presented to automatic recognize the abnormal sky-subtraction spectra based on Hadoop platform. Obtain through the experiment, the Hadoop platform can implement the recognition with more speed and efficiency, and the simple method can recognize the abnormal sky-subtraction spectra and find the abnormal skyline positions of different residual strength effectively, can be applied to the automatic detection of abnormal sky-subtraction of large number of spectra.
Automatic detection and recognition of signs from natural scenes.
Chen, Xilin; Yang, Jie; Zhang, Jing; Waibel, Alex
2004-01-01
In this paper, we present an approach to automatic detection and recognition of signs from natural scenes, and its application to a sign translation task. The proposed approach embeds multiresolution and multiscale edge detection, adaptive searching, color analysis, and affine rectification in a hierarchical framework for sign detection, with different emphases at each phase to handle the text in different sizes, orientations, color distributions and backgrounds. We use affine rectification to recover deformation of the text regions caused by an inappropriate camera view angle. The procedure can significantly improve text detection rate and optical character recognition (OCR) accuracy. Instead of using binary information for OCR, we extract features from an intensity image directly. We propose a local intensity normalization method to effectively handle lighting variations, followed by a Gabor transform to obtain local features, and finally a linear discriminant analysis (LDA) method for feature selection. We have applied the approach in developing a Chinese sign translation system, which can automatically detect and recognize Chinese signs as input from a camera, and translate the recognized text into English.
Neural-network classifiers for automatic real-world aerial image recognition
NASA Astrophysics Data System (ADS)
Greenberg, Shlomo; Guterman, Hugo
1996-08-01
We describe the application of the multilayer perceptron (MLP) network and a version of the adaptive resonance theory version 2-A (ART 2-A) network to the problem of automatic aerial image recognition (AAIR). The classification of aerial images, independent of their positions and orientations, is required for automatic tracking and target recognition. Invariance is achieved by the use of different invariant feature spaces in combination with supervised and unsupervised neural networks. The performance of neural-network-based classifiers in conjunction with several types of invariant AAIR global features, such as the Fourier-transform space, Zernike moments, central moments, and polar transforms, are examined. The advantages of this approach are discussed. The performance of the MLP network is compared with that of a classical correlator. The MLP neural-network correlator outperformed the binary phase-only filter (BPOF) correlator. It was found that the ART 2-A distinguished itself with its speed and its low number of required training vectors. However, only the MLP classifier was able to deal with a combination of shift and rotation geometric distortions.
Neural-network classifiers for automatic real-world aerial image recognition.
Greenberg, S; Guterman, H
1996-08-10
We describe the application of the multilayer perceptron (MLP) network and a version of the adaptive resonance theory version 2-A (ART 2-A) network to the problem of automatic aerial image recognition (AAIR). The classification of aerial images, independent of their positions and orientations, is required for automatic tracking and target recognition. Invariance is achieved by the use of different invariant feature spaces in combination with supervised and unsupervised neural networks. The performance of neural-network-based classifiers in conjunction with several types of invariant AAIR global features, such as the Fourier-transform space, Zernike moments, central moments, and polar transforms, are examined. The advantages of this approach are discussed. The performance of the MLP network is compared with that of a classical correlator. The MLP neural-network correlator outperformed the binary phase-only filter (BPOF) correlator. It was found that the ART 2-A distinguished itself with its speed and its low number of required training vectors. However, only the MLP classifier was able to deal with a combination of shift and rotation geometric distortions.
NASA Astrophysics Data System (ADS)
Sanger, Demas S.; Haneishi, Hideaki; Miyake, Yoichi
1995-08-01
This paper proposed a simple and automatic method for recognizing the light sources from various color negative film brands by means of digital image processing. First, we stretched the image obtained from a negative based on the standardized scaling factors, then extracted the dominant color component among red, green, and blue components of the stretched image. The dominant color component became the discriminator for the recognition. The experimental results verified that any one of the three techniques could recognize the light source from negatives of any film brands and all brands greater than 93.2 and 96.6% correct recognitions, respectively. This method is significant for the automation of color quality control in color reproduction from color negative film in mass processing and printing machine.
Fashioning the Face: Sensorimotor Simulation Contributes to Facial Expression Recognition.
Wood, Adrienne; Rychlowska, Magdalena; Korb, Sebastian; Niedenthal, Paula
2016-03-01
When we observe a facial expression of emotion, we often mimic it. This automatic mimicry reflects underlying sensorimotor simulation that supports accurate emotion recognition. Why this is so is becoming more obvious: emotions are patterns of expressive, behavioral, physiological, and subjective feeling responses. Activation of one component can therefore automatically activate other components. When people simulate a perceived facial expression, they partially activate the corresponding emotional state in themselves, which provides a basis for inferring the underlying emotion of the expresser. We integrate recent evidence in favor of a role for sensorimotor simulation in emotion recognition. We then connect this account to a domain-general understanding of how sensory information from multiple modalities is integrated to generate perceptual predictions in the brain. Copyright © 2016 Elsevier Ltd. All rights reserved.
Effect of speech-intrinsic variations on human and automatic recognition of spoken phonemes.
Meyer, Bernd T; Brand, Thomas; Kollmeier, Birger
2011-01-01
The aim of this study is to quantify the gap between the recognition performance of human listeners and an automatic speech recognition (ASR) system with special focus on intrinsic variations of speech, such as speaking rate and effort, altered pitch, and the presence of dialect and accent. Second, it is investigated if the most common ASR features contain all information required to recognize speech in noisy environments by using resynthesized ASR features in listening experiments. For the phoneme recognition task, the ASR system achieved the human performance level only when the signal-to-noise ratio (SNR) was increased by 15 dB, which is an estimate for the human-machine gap in terms of the SNR. The major part of this gap is attributed to the feature extraction stage, since human listeners achieve comparable recognition scores when the SNR difference between unaltered and resynthesized utterances is 10 dB. Intrinsic variabilities result in strong increases of error rates, both in human speech recognition (HSR) and ASR (with a relative increase of up to 120%). An analysis of phoneme duration and recognition rates indicates that human listeners are better able to identify temporal cues than the machine at low SNRs, which suggests incorporating information about the temporal dynamics of speech into ASR systems.
Hybrid neuro-fuzzy approach for automatic vehicle license plate recognition
NASA Astrophysics Data System (ADS)
Lee, Hsi-Chieh; Jong, Chung-Shi
1998-03-01
Most currently available vehicle identification systems use techniques such as R.F., microwave, or infrared to help identifying the vehicle. Transponders are usually installed in the vehicle in order to transmit the corresponding information to the sensory system. It is considered expensive to install a transponder in each vehicle and the malfunction of the transponder will result in the failure of the vehicle identification system. In this study, novel hybrid approach is proposed for automatic vehicle license plate recognition. A system prototype is built which can be used independently or cooperating with current vehicle identification system in identifying a vehicle. The prototype consists of four major modules including the module for license plate region identification, the module for character extraction from the license plate, the module for character recognition, and the module for the SimNet neuro-fuzzy system. To test the performance of the proposed system, three hundred and eighty vehicle image samples are taken by a digital camera. The license plate recognition success rate of the prototype is approximately 91% while the character recognition success rate of the prototype is approximately 97%.
Watch what you say, your computer might be listening: A review of automated speech recognition
NASA Technical Reports Server (NTRS)
Degennaro, Stephen V.
1991-01-01
Spoken language is the most convenient and natural means by which people interact with each other and is, therefore, a promising candidate for human-machine interactions. Speech also offers an additional channel for hands-busy applications, complementing the use of motor output channels for control. Current speech recognition systems vary considerably across a number of important characteristics, including vocabulary size, speaking mode, training requirements for new speakers, robustness to acoustic environments, and accuracy. Algorithmically, these systems range from rule-based techniques through more probabilistic or self-learning approaches such as hidden Markov modeling and neural networks. This tutorial begins with a brief summary of the relevant features of current speech recognition systems and the strengths and weaknesses of the various algorithmic approaches.
NASA Astrophysics Data System (ADS)
El Bekri, Nadia; Angele, Susanne; Ruckhäberle, Martin; Peinsipp-Byma, Elisabeth; Haelke, Bruno
2015-10-01
This paper introduces an interactive recognition assistance system for imaging reconnaissance. This system supports aerial image analysts on missions during two main tasks: Object recognition and infrastructure analysis. Object recognition concentrates on the classification of one single object. Infrastructure analysis deals with the description of the components of an infrastructure and the recognition of the infrastructure type (e.g. military airfield). Based on satellite or aerial images, aerial image analysts are able to extract single object features and thereby recognize different object types. It is one of the most challenging tasks in the imaging reconnaissance. Currently, there are no high potential ATR (automatic target recognition) applications available, as consequence the human observer cannot be replaced entirely. State-of-the-art ATR applications cannot assume in equal measure human perception and interpretation. Why is this still such a critical issue? First, cluttered and noisy images make it difficult to automatically extract, classify and identify object types. Second, due to the changed warfare and the rise of asymmetric threats it is nearly impossible to create an underlying data set containing all features, objects or infrastructure types. Many other reasons like environmental parameters or aspect angles compound the application of ATR supplementary. Due to the lack of suitable ATR procedures, the human factor is still important and so far irreplaceable. In order to use the potential benefits of the human perception and computational methods in a synergistic way, both are unified in an interactive assistance system. RecceMan® (Reconnaissance Manual) offers two different modes for aerial image analysts on missions: the object recognition mode and the infrastructure analysis mode. The aim of the object recognition mode is to recognize a certain object type based on the object features that originated from the image signatures. The infrastructure analysis mode pursues the goal to analyze the function of the infrastructure. The image analyst extracts visually certain target object signatures, assigns them to corresponding object features and is finally able to recognize the object type. The system offers him the possibility to assign the image signatures to features given by sample images. The underlying data set contains a wide range of objects features and object types for different domains like ships or land vehicles. Each domain has its own feature tree developed by aerial image analyst experts. By selecting the corresponding features, the possible solution set of objects is automatically reduced and matches only the objects that contain the selected features. Moreover, we give an outlook of current research in the field of ground target analysis in which we deal with partly automated methods to extract image signatures and assign them to the corresponding features. This research includes methods for automatically determining the orientation of an object and geometric features like width and length of the object. This step enables to reduce automatically the possible object types offered to the image analyst by the interactive recognition assistance system.
Nirme, Jens; Haake, Magnus; Lyberg Åhlander, Viveka; Brännström, Jonas; Sahlén, Birgitta
2018-04-05
Seeing a speaker's face facilitates speech recognition, particularly under noisy conditions. Evidence for how it might affect comprehension of the content of the speech is more sparse. We investigated how children's listening comprehension is affected by multi-talker babble noise, with or without presentation of a digitally animated virtual speaker, and whether successful comprehension is related to performance on a test of executive functioning. We performed a mixed-design experiment with 55 (34 female) participants (8- to 9-year-olds), recruited from Swedish elementary schools. The children were presented with four different narratives, each in one of four conditions: audio-only presentation in a quiet setting, audio-only presentation in noisy setting, audio-visual presentation in a quiet setting, and audio-visual presentation in a noisy setting. After each narrative, the children answered questions on the content and rated their perceived listening effort. Finally, they performed a test of executive functioning. We found significantly fewer correct answers to explicit content questions after listening in noise. This negative effect was only mitigated to a marginally significant degree by audio-visual presentation. Strong executive function only predicted more correct answers in quiet settings. Altogether, our results are inconclusive regarding how seeing a virtual speaker affects listening comprehension. We discuss how methodological adjustments, including modifications to our virtual speaker, can be used to discriminate between possible explanations to our results and contribute to understanding the listening conditions children face in a typical classroom.
Kharlamov, Viktor; Campbell, Kenneth; Kazanina, Nina
2011-11-01
Speech sounds are not always perceived in accordance with their acoustic-phonetic content. For example, an early and automatic process of perceptual repair, which ensures conformity of speech inputs to the listener's native language phonology, applies to individual input segments that do not exist in the native inventory or to sound sequences that are illicit according to the native phonotactic restrictions on sound co-occurrences. The present study with Russian and Canadian English speakers shows that listeners may perceive phonetically distinct and licit sound sequences as equivalent when the native language system provides robust evidence for mapping multiple phonetic forms onto a single phonological representation. In Russian, due to an optional but productive t-deletion process that affects /stn/ clusters, the surface forms [sn] and [stn] may be phonologically equivalent and map to a single phonological form /stn/. In contrast, [sn] and [stn] clusters are usually phonologically distinct in (Canadian) English. Behavioral data from identification and discrimination tasks indicated that [sn] and [stn] clusters were more confusable for Russian than for English speakers. The EEG experiment employed an oddball paradigm with nonwords [asna] and [astna] used as the standard and deviant stimuli. A reliable mismatch negativity response was elicited approximately 100 msec postchange in the English group but not in the Russian group. These findings point to a perceptual repair mechanism that is engaged automatically at a prelexical level to ensure immediate encoding of speech inputs in phonological terms, which in turn enables efficient access to the meaning of a spoken utterance.
Automatically Detecting Likely Edits in Clinical Notes Created Using Automatic Speech Recognition
Lybarger, Kevin; Ostendorf, Mari; Yetisgen, Meliha
2017-01-01
The use of automatic speech recognition (ASR) to create clinical notes has the potential to reduce costs associated with note creation for electronic medical records, but at current system accuracy levels, post-editing by practitioners is needed to ensure note quality. Aiming to reduce the time required to edit ASR transcripts, this paper investigates novel methods for automatic detection of edit regions within the transcripts, including both putative ASR errors but also regions that are targets for cleanup or rephrasing. We create detection models using logistic regression and conditional random field models, exploring a variety of text-based features that consider the structure of clinical notes and exploit the medical context. Different medical text resources are used to improve feature extraction. Experimental results on a large corpus of practitioner-edited clinical notes show that 67% of sentence-level edits and 45% of word-level edits can be detected with a false detection rate of 15%. PMID:29854187
Neural networks: Alternatives to conventional techniques for automatic docking
NASA Technical Reports Server (NTRS)
Vinz, Bradley L.
1994-01-01
Automatic docking of orbiting spacecraft is a crucial operation involving the identification of vehicle orientation as well as complex approach dynamics. The chaser spacecraft must be able to recognize the target spacecraft within a scene and achieve accurate closing maneuvers. In a video-based system, a target scene must be captured and transformed into a pattern of pixels. Successful recognition lies in the interpretation of this pattern. Due to their powerful pattern recognition capabilities, artificial neural networks offer a potential role in interpretation and automatic docking processes. Neural networks can reduce the computational time required by existing image processing and control software. In addition, neural networks are capable of recognizing and adapting to changes in their dynamic environment, enabling enhanced performance, redundancy, and fault tolerance. Most neural networks are robust to failure, capable of continued operation with a slight degradation in performance after minor failures. This paper discusses the particular automatic docking tasks neural networks can perform as viable alternatives to conventional techniques.
"That Sounds So Cooool": Entanglements of Children, Digital Tools, and Literacy Practices
ERIC Educational Resources Information Center
Toohey, Kelleen; Dagenais, Diane; Fodor, Andreea; Hof, Linda; Nuñez, Omar; Singh, Angelpreet; Schulze, Liz
2015-01-01
Many observers have argued that minority language speakers often have difficulty with school-based literacy and that the poorer school achievement of such learners occurs at least partly as a result of these difficulties. At the same time, many have argued for a recognition of the multiple literacies required for citizens in a 21st century world.…
Pitch-Based Segregation of Reverberant Speech
2005-02-01
speaker recognition in real environments, audio information retrieval and hearing prosthesis. Second, although binaural listening improves the...intelligibility of target speech under anechoic conditions (Bronkhorst, 2000), this binaural advantage is largely eliminated by reverberation (Plomp, 1976...Brown and Cooke, 1994; Wang and Brown, 1999; Hu and Wang, 2004) as well as in binaural separation (e.g., Roman et al., 2003; Palomaki et al., 2004
Facilitating Comprehension of Non-Native English Speakers during Lectures in English with STR-Texts
ERIC Educational Resources Information Center
Shadiev, Rustam; Wu, Ting-Ting; Huang, Yueh-Min
2018-01-01
We provided texts generated by speech-to text-recognition (STR) technology for non-native English speaking students during lectures in English in order to test whether STR-texts were useful for enhancing students' comprehension of lectures. To this end, we carried out an experiment in which 60 participants were randomly assigned to a control group…
ERIC Educational Resources Information Center
Scott Instruments Corp., Denton, TX.
This project was designed to develop techniques for adding low-cost speech synthesis to educational software. Four tasks were identified for the study: (1) select a microcomputer with a built-in analog-to-digital converter that is currently being used in educational environments; (2) determine the feasibility of implementing expansion and playback…
STS-41 Voice Command System Flight Experiment Report
NASA Technical Reports Server (NTRS)
Salazar, George A.
1981-01-01
This report presents the results of the Voice Command System (VCS) flight experiment on the five-day STS-41 mission. Two mission specialists,Bill Shepherd and Bruce Melnick, used the speaker-dependent system to evaluate the operational effectiveness of using voice to control a spacecraft system. In addition, data was gathered to analyze the effects of microgravity on speech recognition performance.
Crossmodal and Incremental Perception of Audiovisual Cues to Emotional Speech
ERIC Educational Resources Information Center
Barkhuysen, Pashiera; Krahmer, Emiel; Swerts, Marc
2010-01-01
In this article we report on two experiments about the perception of audiovisual cues to emotional speech. The article addresses two questions: (1) how do visual cues from a speaker's face to emotion relate to auditory cues, and (2) what is the recognition speed for various facial cues to emotion? Both experiments reported below are based on tests…
Transitivity, Space, and Hand: The Spatial Grounding of Syntax
ERIC Educational Resources Information Center
Boiteau, Timothy W.; Almor, Amit
2017-01-01
Previous research has linked the concept of number and other ordinal series to space via a spatially oriented mental number line. In addition, it has been shown that in visual scene recognition and production, speakers of a language with a left-to-right orthography respond faster to and tend to draw images in which the agent of an action is…
ERIC Educational Resources Information Center
Sachtleben, Annette; Denny, Heather
2012-01-01
Following the recent interest in the teaching of pragmatics and the recognition of its importance for both cross-cultural communication and new speakers of an additional language, the authors carried out an action research project to evaluate the effectiveness of a new approach to the teaching of pragmatics. This involved the use of semiauthentic…
Automatic recognition of fundamental tissues on histology images of the human cardiovascular system.
Mazo, Claudia; Trujillo, Maria; Alegre, Enrique; Salazar, Liliana
2016-10-01
Cardiovascular disease is the leading cause of death worldwide. Therefore, techniques for improving diagnosis and treatment in this field have become key areas for research. In particular, approaches for tissue image processing may support education system and medical practice. In this paper, an approach to automatic recognition and classification of fundamental tissues, using morphological information is presented. Taking a 40× or 10× histological image as input, three clusters are created with the k-means algorithm using a structural tensor and the red and the green channels. Loose connective tissue, light regions and cell nuclei are recognised on 40× images. Then, the cell nuclei's features - shape and spatial projection - and light regions are used to recognise and classify epithelial cells and tissue into flat, cubic and cylindrical. In a similar way, light regions, loose connective and muscle tissues are recognised on 10× images. Finally, the tissue's function and composition are used to refine muscle tissue recognition. Experimental validation is then carried out by histologist following expert criteria, along with manually annotated images that are used as a ground-truth. The results revealed that the proposed approach classified the fundamental tissues in a similar way to the conventional method employed by histologists. The proposed automatic recognition approach provides for epithelial tissues a sensitivity of 0.79 for cubic, 0.85 for cylindrical and 0.91 for flat. Furthermore, the experts gave our method an average score of 4.85 out of 5 in the recognition of loose connective tissue and 4.82 out of 5 for muscle tissue recognition. Copyright © 2016 Elsevier Ltd. All rights reserved.
Facilitator control as automatic behavior: A verbal behavior analysis
Hall, Genae A.
1993-01-01
Several studies of facilitated communication have demonstrated that the facilitators were controlling and directing the typing, although they appeared to be unaware of doing so. Such results shift the focus of analysis to the facilitator's behavior and raise questions regarding the controlling variables for that behavior. This paper analyzes facilitator behavior as an instance of automatic verbal behavior, from the perspective of Skinner's (1957) book Verbal Behavior. Verbal behavior is automatic when the speaker or writer is not stimulated by the behavior at the time of emission, the behavior is not edited, the products of behavior differ from what the person would produce normally, and the behavior is attributed to an outside source. All of these characteristics appear to be present in facilitator behavior. Other variables seem to account for the thematic content of the typed messages. These variables also are discussed. PMID:22477083
Terrain type recognition using ERTS-1 MSS images
NASA Technical Reports Server (NTRS)
Gramenopoulos, N.
1973-01-01
For the automatic recognition of earth resources from ERTS-1 digital tapes, both multispectral and spatial pattern recognition techniques are important. Recognition of terrain types is based on spatial signatures that become evident by processing small portions of an image through selected algorithms. An investigation of spatial signatures that are applicable to ERTS-1 MSS images is described. Artifacts in the spatial signatures seem to be related to the multispectral scanner. A method for suppressing such artifacts is presented. Finally, results of terrain type recognition for one ERTS-1 image are presented.
Automatic anatomy recognition via multiobject oriented active shape models.
Chen, Xinjian; Udupa, Jayaram K; Alavi, Abass; Torigian, Drew A
2010-12-01
This paper studies the feasibility of developing an automatic anatomy recognition (AAR) system in clinical radiology and demonstrates its operation on clinical 2D images. The anatomy recognition method described here consists of two main components: (a) multiobject generalization of OASM and (b) object recognition strategies. The OASM algorithm is generalized to multiple objects by including a model for each object and assigning a cost structure specific to each object in the spirit of live wire. The delineation of multiobject boundaries is done in MOASM via a three level dynamic programming algorithm, wherein the first level is at pixel level which aims to find optimal oriented boundary segments between successive landmarks, the second level is at landmark level which aims to find optimal location for the landmarks, and the third level is at the object level which aims to find optimal arrangement of object boundaries over all objects. The object recognition strategy attempts to find that pose vector (consisting of translation, rotation, and scale component) for the multiobject model that yields the smallest total boundary cost for all objects. The delineation and recognition accuracies were evaluated separately utilizing routine clinical chest CT, abdominal CT, and foot MRI data sets. The delineation accuracy was evaluated in terms of true and false positive volume fractions (TPVF and FPVF). The recognition accuracy was assessed (1) in terms of the size of the space of the pose vectors for the model assembly that yielded high delineation accuracy, (2) as a function of the number of objects and objects' distribution and size in the model, (3) in terms of the interdependence between delineation and recognition, and (4) in terms of the closeness of the optimum recognition result to the global optimum. When multiple objects are included in the model, the delineation accuracy in terms of TPVF can be improved to 97%-98% with a low FPVF of 0.1%-0.2%. Typically, a recognition accuracy of > or = 90% yielded a TPVF > or = 95% and FPVF < or = 0.5%. Over the three data sets and over all tested objects, in 97% of the cases, the optimal solutions found by the proposed method constituted the true global optimum. The experimental results showed the feasibility and efficacy of the proposed automatic anatomy recognition system. Increasing the number of objects in the model can significantly improve both recognition and delineation accuracy. More spread out arrangement of objects in the model can lead to improved recognition and delineation accuracy. Including larger objects in the model also improved recognition and delineation. The proposed method almost always finds globally optimum solutions.
1988-12-15
ORGANIZATION PEPOPT NVBER S; 6a E O c PEPFC=V’NG CPC;% ZA " ON 6o OrFiCE SYMBOL 7a NAME OF MONITORING ORGANZA - ON 0 University of Califor-ia (If appi cable...readers. Experimental Aging Research, 7, 253-268. McLeod, B., & McLaughlin, B. (1986). Restructuring or automatization? Reading in a second language
NASA Technical Reports Server (NTRS)
1973-01-01
The development, construction, and test of a 100-word vocabulary near real time word recognition system are reported. Included are reasonable replacement of any one or all 100 words in the vocabulary, rapid learning of a new speaker, storage and retrieval of training sets, verbal or manual single word deletion, continuous adaptation with verbal or manual error correction, on-line verification of vocabulary as spoken, system modes selectable via verification display keyboard, relationship of classified word to neighboring word, and a versatile input/output interface to accommodate a variety of applications.
Puzzle test: A tool for non-analytical clinical reasoning assessment.
Monajemi, Alireza; Yaghmaei, Minoo
2016-01-01
Most contemporary clinical reasoning tests typically assess non-automatic thinking. Therefore, a test is needed to measure automatic reasoning or pattern recognition, which has been largely neglected in clinical reasoning tests. The Puzzle Test (PT) is dedicated to assess automatic clinical reasoning in routine situations. This test has been introduced first in 2009 by Monajemi et al in the Olympiad for Medical Sciences Students.PT is an item format that has gained acceptance in medical education, but no detailed guidelines exist for this test's format, construction and scoring. In this article, a format is described and the steps to prepare and administer valid and reliable PTs are presented. PT examines a specific clinical reasoning task: Pattern recognition. PT does not replace other clinical reasoning assessment tools. However, it complements them in strategies for assessing comprehensive clinical reasoning.
Some effects of stress on users of a voice recognition system: A preliminary inquiry
NASA Astrophysics Data System (ADS)
French, B. A.
1983-03-01
Recent work with Automatic Speech Recognition has focused on applications and productivity considerations in the man-machine interface. This thesis is an attempt to see if placing users of such equipment under time-induced stress has an effect on their percent correct recognition rates. Subjects were given a message-handling task of fixed length and allowed progressively shorter times to attempt to complete it. Questionnaire responses indicate stress levels increased with decreased time-allowance; recognition rates decreased as time was reduced.
2014-03-27
and machine learning for a range of research including such topics as medical imaging [10] and handwriting recognition [11]. The type of feature...1989. [11] C. Bahlmann, B. Haasdonk, and H. Burkhardt, “Online handwriting recognition with support vector machines-a kernel approach,” in Eighth...International Workshop on Frontiers in Handwriting Recognition, pp. 49–54, IEEE, 2002. [12] C. Cortes and V. Vapnik, “Support-vector networks,” Machine
Assessing the impact of graphical quality on automatic text recognition in digital maps
NASA Astrophysics Data System (ADS)
Chiang, Yao-Yi; Leyk, Stefan; Honarvar Nazari, Narges; Moghaddam, Sima; Tan, Tian Xiang
2016-08-01
Converting geographic features (e.g., place names) in map images into a vector format is the first step for incorporating cartographic information into a geographic information system (GIS). With the advancement in computational power and algorithm design, map processing systems have been considerably improved over the last decade. However, the fundamental map processing techniques such as color image segmentation, (map) layer separation, and object recognition are sensitive to minor variations in graphical properties of the input image (e.g., scanning resolution). As a result, most map processing results would not meet user expectations if the user does not "properly" scan the map of interest, pre-process the map image (e.g., using compression or not), and train the processing system, accordingly. These issues could slow down the further advancement of map processing techniques as such unsuccessful attempts create a discouraged user community, and less sophisticated tools would be perceived as more viable solutions. Thus, it is important to understand what kinds of maps are suitable for automatic map processing and what types of results and process-related errors can be expected. In this paper, we shed light on these questions by using a typical map processing task, text recognition, to discuss a number of map instances that vary in suitability for automatic processing. We also present an extensive experiment on a diverse set of scanned historical maps to provide measures of baseline performance of a standard text recognition tool under varying map conditions (graphical quality) and text representations (that can vary even within the same map sheet). Our experimental results help the user understand what to expect when a fully or semi-automatic map processing system is used to process a scanned map with certain (varying) graphical properties and complexities in map content.
Human Activity Recognition in AAL Environments Using Random Projections.
Damaševičius, Robertas; Vasiljevas, Mindaugas; Šalkevičius, Justas; Woźniak, Marcin
2016-01-01
Automatic human activity recognition systems aim to capture the state of the user and its environment by exploiting heterogeneous sensors attached to the subject's body and permit continuous monitoring of numerous physiological signals reflecting the state of human actions. Successful identification of human activities can be immensely useful in healthcare applications for Ambient Assisted Living (AAL), for automatic and intelligent activity monitoring systems developed for elderly and disabled people. In this paper, we propose the method for activity recognition and subject identification based on random projections from high-dimensional feature space to low-dimensional projection space, where the classes are separated using the Jaccard distance between probability density functions of projected data. Two HAR domain tasks are considered: activity identification and subject identification. The experimental results using the proposed method with Human Activity Dataset (HAD) data are presented.
Integrated approach for automatic target recognition using a network of collaborative sensors.
Mahalanobis, Abhijit; Van Nevel, Alan
2006-10-01
We introduce what is believed to be a novel concept by which several sensors with automatic target recognition (ATR) capability collaborate to recognize objects. Such an approach would be suitable for netted systems in which the sensors and platforms can coordinate to optimize end-to-end performance. We use correlation filtering techniques to facilitate the development of the concept, although other ATR algorithms may be easily substituted. Essentially, a self-configuring geometry of netted platforms is proposed that positions the sensors optimally with respect to each other, and takes into account the interactions among the sensor, the recognition algorithms, and the classes of the objects to be recognized. We show how such a paradigm optimizes overall performance, and illustrate the collaborative ATR scheme for recognizing targets in synthetic aperture radar imagery by using viewing position as a sensor parameter.
Human Activity Recognition in AAL Environments Using Random Projections
Damaševičius, Robertas; Vasiljevas, Mindaugas; Šalkevičius, Justas; Woźniak, Marcin
2016-01-01
Automatic human activity recognition systems aim to capture the state of the user and its environment by exploiting heterogeneous sensors attached to the subject's body and permit continuous monitoring of numerous physiological signals reflecting the state of human actions. Successful identification of human activities can be immensely useful in healthcare applications for Ambient Assisted Living (AAL), for automatic and intelligent activity monitoring systems developed for elderly and disabled people. In this paper, we propose the method for activity recognition and subject identification based on random projections from high-dimensional feature space to low-dimensional projection space, where the classes are separated using the Jaccard distance between probability density functions of projected data. Two HAR domain tasks are considered: activity identification and subject identification. The experimental results using the proposed method with Human Activity Dataset (HAD) data are presented. PMID:27413392
A beat-to-beat calculator for the diastolic pressure time index and the tension time index.
Nose, Y; Tajimi, T; Watanabe, Y; Yokota, M; Akazawa, K; Nakamura, M
1987-01-01
We have developed a beat-to-beat calculator which can calculate in real-time the ratio of the diastolic pressure time index (DPTI), and the tension time index (TTI) as an index of the myocardial oxygen supply/demand balance. Physicians set up presumed value for the left ventricular endodiastolic pressure, a search area for the dicrotic notch, a threshold for the onset of the up-slope and the corresponding value of the calibration signal on the digital switches of the calculator. Next, the arterial pressure analog signal is input into the calculator. The calculator searches automatically for both the onset of the up-slope and the dicrotic notch. The arterial pressure curve is displayed beat-to-beat with the recognized onset and the dicrotic notch on the CRT to be confirmed by physicians. When physicians do not agree with the automatic recognition they can fit the automatic recognition to the observation. If the recognition of the onset is inadequate, the threshold can be re-adjusted to trigger the onset. If recognition of the dicrotic notch is inadequate, the physician can adjust the search-area. Therefore, physicians who operate the calculator can rely on the calculated DPTI/TTI. This calculator can continuously monitor the myocardial oxygen supply/demand balance in patients with acute myocardial infarction or just after open-heart surgery.
Electrophysiological Evidence of Automatic Early Semantic Processing
ERIC Educational Resources Information Center
Hinojosa, Jose A.; Martin-Loeches, Manuel; Munoz, Francisco; Casado, Pilar; Pozo, Miguel A.
2004-01-01
This study investigates the automatic-controlled nature of early semantic processing by means of the Recognition Potential (RP), an event-related potential response that reflects lexical selection processes. For this purpose tasks differing in their processing requirements were used. Half of the participants performed a physical task involving a…
Attempting to "Increase Intake from the Input": Attention and Word Learning in Children with Autism.
Tenenbaum, Elena J; Amso, Dima; Righi, Giulia; Sheinkopf, Stephen J
2017-06-01
Previous work has demonstrated that social attention is related to early language abilities. We explored whether we can facilitate word learning among children with autism by directing attention to areas of the scene that have been demonstrated as relevant for successful word learning. We tracked eye movements to faces and objects while children watched videos of a woman teaching them new words. Test trials measured participants' recognition of these novel word-object pairings. Results indicate that for children with autism and typically developing children, pointing to the speaker's mouth while labeling a novel object impaired performance, likely because it distracted participants from the target object. In contrast, for children with autism, holding the object close to the speaker's mouth improved performance.
Material recognition based on thermal cues: Mechanisms and applications.
Ho, Hsin-Ni
2018-01-01
Some materials feel colder to the touch than others, and we can use this difference in perceived coldness for material recognition. This review focuses on the mechanisms underlying material recognition based on thermal cues. It provides an overview of the physical, perceptual, and cognitive processes involved in material recognition. It also describes engineering domains in which material recognition based on thermal cues have been applied. This includes haptic interfaces that seek to reproduce the sensations associated with contact in virtual environments and tactile sensors aim for automatic material recognition. The review concludes by considering the contributions of this line of research in both science and engineering.
Material recognition based on thermal cues: Mechanisms and applications
Ho, Hsin-Ni
2018-01-01
ABSTRACT Some materials feel colder to the touch than others, and we can use this difference in perceived coldness for material recognition. This review focuses on the mechanisms underlying material recognition based on thermal cues. It provides an overview of the physical, perceptual, and cognitive processes involved in material recognition. It also describes engineering domains in which material recognition based on thermal cues have been applied. This includes haptic interfaces that seek to reproduce the sensations associated with contact in virtual environments and tactile sensors aim for automatic material recognition. The review concludes by considering the contributions of this line of research in both science and engineering. PMID:29687043
Automatically Log Off Upon Disappearance of Facial Image
2005-03-01
log off a PC when the user’s face disappears for an adjustable time interval. Among the fundamental technologies of biometrics, facial recognition is... facial recognition products. In this report, a brief overview of face detection technologies is provided. The particular neural network-based face...ensure that the user logging onto the system is the same person. Among the fundamental technologies of biometrics, facial recognition is the only
Signal recognition and parameter estimation of BPSK-LFM combined modulation
NASA Astrophysics Data System (ADS)
Long, Chao; Zhang, Lin; Liu, Yu
2015-07-01
Intra-pulse analysis plays an important role in electronic warfare. Intra-pulse feature abstraction focuses on primary parameters such as instantaneous frequency, modulation, and symbol rate. In this paper, automatic modulation recognition and feature extraction for combined BPSK-LFM modulation signals based on decision theoretic approach is studied. The simulation results show good recognition effect and high estimation precision, and the system is easy to be realized.
NASA Astrophysics Data System (ADS)
Kuznetsov, Michael V.
2006-05-01
For reliable teamwork of various systems of automatic telecommunication including transferring systems of optical communication networks it is necessary authentic recognition of signals for one- or two-frequency service signal system. The analysis of time parameters of an accepted signal allows increasing reliability of detection and recognition of the service signal system on a background of speech.
ERIC Educational Resources Information Center
Stinson, Michael; Elliot, Lisa; McKee, Barbara; Coyne, Gina
This report discusses a project that adapted new automatic speech recognition (ASR) technology to provide real-time speech-to-text transcription as a support service for students who are deaf and hard of hearing (D/HH). In this system, as the teacher speaks, a hearing intermediary, or captionist, dictates into the speech recognition system in a…
Chun, Hong-Woo; Tsuruoka, Yoshimasa; Kim, Jin-Dong; Shiba, Rie; Nagata, Naoki; Hishiki, Teruyoshi; Tsujii, Jun'ichi
2006-01-01
Background Automatic recognition of relations between a specific disease term and its relevant genes or protein terms is an important practice of bioinformatics. Considering the utility of the results of this approach, we identified prostate cancer and gene terms with the ID tags of public biomedical databases. Moreover, considering that genetics experts will use our results, we classified them based on six topics that can be used to analyze the type of prostate cancers, genes, and their relations. Methods We developed a maximum entropy-based named entity recognizer and a relation recognizer and applied them to a corpus-based approach. We collected prostate cancer-related abstracts from MEDLINE, and constructed an annotated corpus of gene and prostate cancer relations based on six topics by biologists. We used it to train the maximum entropy-based named entity recognizer and relation recognizer. Results Topic-classified relation recognition achieved 92.1% precision for the relation (an increase of 11.0% from that obtained in a baseline experiment). For all topics, the precision was between 67.6 and 88.1%. Conclusion A series of experimental results revealed two important findings: a carefully designed relation recognition system using named entity recognition can improve the performance of relation recognition, and topic-classified relation recognition can be effectively addressed through a corpus-based approach using manual annotation and machine learning techniques. PMID:17134477
Chun, Hong-Woo; Tsuruoka, Yoshimasa; Kim, Jin-Dong; Shiba, Rie; Nagata, Naoki; Hishiki, Teruyoshi; Tsujii, Jun'ichi
2006-11-24
Automatic recognition of relations between a specific disease term and its relevant genes or protein terms is an important practice of bioinformatics. Considering the utility of the results of this approach, we identified prostate cancer and gene terms with the ID tags of public biomedical databases. Moreover, considering that genetics experts will use our results, we classified them based on six topics that can be used to analyze the type of prostate cancers, genes, and their relations. We developed a maximum entropy-based named entity recognizer and a relation recognizer and applied them to a corpus-based approach. We collected prostate cancer-related abstracts from MEDLINE, and constructed an annotated corpus of gene and prostate cancer relations based on six topics by biologists. We used it to train the maximum entropy-based named entity recognizer and relation recognizer. Topic-classified relation recognition achieved 92.1% precision for the relation (an increase of 11.0% from that obtained in a baseline experiment). For all topics, the precision was between 67.6 and 88.1%. A series of experimental results revealed two important findings: a carefully designed relation recognition system using named entity recognition can improve the performance of relation recognition, and topic-classified relation recognition can be effectively addressed through a corpus-based approach using manual annotation and machine learning techniques.
A model of traffic signs recognition with convolutional neural network
NASA Astrophysics Data System (ADS)
Hu, Haihe; Li, Yujian; Zhang, Ting; Huo, Yi; Kuang, Wenqing
2016-10-01
In real traffic scenes, the quality of captured images are generally low due to some factors such as lighting conditions, and occlusion on. All of these factors are challengeable for automated recognition algorithms of traffic signs. Deep learning has provided a new way to solve this kind of problems recently. The deep network can automatically learn features from a large number of data samples and obtain an excellent recognition performance. We therefore approach this task of recognition of traffic signs as a general vision problem, with few assumptions related to road signs. We propose a model of Convolutional Neural Network (CNN) and apply the model to the task of traffic signs recognition. The proposed model adopts deep CNN as the supervised learning model, directly takes the collected traffic signs image as the input, alternates the convolutional layer and subsampling layer, and automatically extracts the features for the recognition of the traffic signs images. The proposed model includes an input layer, three convolutional layers, three subsampling layers, a fully-connected layer, and an output layer. To validate the proposed model, the experiments are implemented using the public dataset of China competition of fuzzy image processing. Experimental results show that the proposed model produces a recognition accuracy of 99.01 % on the training dataset, and yield a record of 92% on the preliminary contest within the fourth best.
Automatic anatomy recognition on CT images with pathology
NASA Astrophysics Data System (ADS)
Huang, Lidong; Udupa, Jayaram K.; Tong, Yubing; Odhner, Dewey; Torigian, Drew A.
2016-03-01
Body-wide anatomy recognition on CT images with pathology becomes crucial for quantifying body-wide disease burden. This, however, is a challenging problem because various diseases result in various abnormalities of objects such as shape and intensity patterns. We previously developed an automatic anatomy recognition (AAR) system [1] whose applicability was demonstrated on near normal diagnostic CT images in different body regions on 35 organs. The aim of this paper is to investigate strategies for adapting the previous AAR system to diagnostic CT images of patients with various pathologies as a first step toward automated body-wide disease quantification. The AAR approach consists of three main steps - model building, object recognition, and object delineation. In this paper, within the broader AAR framework, we describe a new strategy for object recognition to handle abnormal images. In the model building stage an optimal threshold interval is learned from near-normal training images for each object. This threshold is optimally tuned to the pathological manifestation of the object in the test image. Recognition is performed following a hierarchical representation of the objects. Experimental results for the abdominal body region based on 50 near-normal images used for model building and 20 abnormal images used for object recognition show that object localization accuracy within 2 voxels for liver and spleen and 3 voxels for kidney can be achieved with the new strategy.
U.S. Air Forces Escape and Evasion Society Recognition Act of 2014
Rep. Tsongas, Niki [D-MA-3
2014-05-20
House - 05/20/2014 Referred to the Committee on Financial Services, and in addition to the Committee on House Administration, for a period to be subsequently determined by the Speaker, in each case for consideration of such provisions as fall within the jurisdiction of the committee... (All Actions) Tracker: This bill has the status IntroducedHere are the steps for Status of Legislation:
ERIC Educational Resources Information Center
Segal, Osnat; Kishon-Rabin, Liat
2017-01-01
Purpose: The stressed word in a sentence (narrow focus [NF]) conveys information about the intent of the speaker and is therefore important for processing spoken language and in social interactions. The ability of participants with severe-to-profound prelingual hearing loss to comprehend NF has rarely been investigated. The purpose of this study…
Spanish as the Second National Language of the United States: Fact, Future, Fiction, or Hope?
ERIC Educational Resources Information Center
Macías, Reynaldo F.
2014-01-01
The status of a language is very often described and measured by different factors, including the length of time it has been in use in a particular territory, the official recognition it has been given by governmental units, and the number and proportion of speakers. Spanish has a unique history and, so some argue status, in the contemporary…
Cross-cultural emotional prosody recognition: evidence from Chinese and British listeners.
Paulmann, Silke; Uskul, Ayse K
2014-01-01
This cross-cultural study of emotional tone of voice recognition tests the in-group advantage hypothesis (Elfenbein & Ambady, 2002) employing a quasi-balanced design. Individuals of Chinese and British background were asked to recognise pseudosentences produced by Chinese and British native speakers, displaying one of seven emotions (anger, disgust, fear, happy, neutral tone of voice, sad, and surprise). Findings reveal that emotional displays were recognised at rates higher than predicted by chance; however, members of each cultural group were more accurate in recognising the displays communicated by a member of their own cultural group than a member of the other cultural group. Moreover, the evaluation of error matrices indicates that both culture groups relied on similar mechanism when recognising emotional displays from the voice. Overall, the study reveals evidence for both universal and culture-specific principles in vocal emotion recognition.
Audiovisual speech facilitates voice learning.
Sheffert, Sonya M; Olson, Elizabeth
2004-02-01
In this research, we investigated the effects of voice and face information on the perceptual learning of talkers and on long-term memory for spoken words. In the first phase, listeners were trained over several days to identify voices from words presented auditorily or audiovisually. The training data showed that visual information about speakers enhanced voice learning, revealing cross-modal connections in talker processing akin to those observed in speech processing. In the second phase, the listeners completed an auditory or audiovisual word recognition memory test in which equal numbers of words were spoken by familiar and unfamiliar talkers. The data showed that words presented by familiar talkers were more likely to be retrieved from episodic memory, regardless of modality. Together, these findings provide new information about the representational code underlying familiar talker recognition and the role of stimulus familiarity in episodic word recognition.
ERIC Educational Resources Information Center
Richler, Jennifer J.; Gauthier, Isabel; Palmeri, Thomas J.
2011-01-01
Are there consequences of calling objects by their names? Lupyan (2008) suggested that overtly labeling objects impairs subsequent recognition memory because labeling shifts stored memory representations of objects toward the category prototype (representational shift hypothesis). In Experiment 1, we show that processing objects at the basic…
Variogram-based feature extraction for neural network recognition of logos
NASA Astrophysics Data System (ADS)
Pham, Tuan D.
2003-03-01
This paper presents a new approach for extracting spatial features of images based on the theory of regionalized variables. These features can be effectively used for automatic recognition of logo images using neural networks. Experimental results on a public-domain logo database show the effectiveness of the proposed approach.
Separating Speed from Accuracy in Beginning Reading Development
ERIC Educational Resources Information Center
Juul, Holger; Poulsen, Mads; Elbro, Carsten
2014-01-01
Phoneme awareness, letter knowledge, and rapid automatized naming (RAN) are well-known kindergarten predictors of later word recognition skills, but it is not clear whether they predict developments in accuracy or speed, or both. The present longitudinal study of 172 Danish beginning readers found that speed of word recognition mainly developed…
Working memory capacity may influence perceived effort during aided speech recognition in noise.
Rudner, Mary; Lunner, Thomas; Behrens, Thomas; Thorén, Elisabet Sundewall; Rönnberg, Jerker
2012-09-01
Recently there has been interest in using subjective ratings as a measure of perceived effort during speech recognition in noise. Perceived effort may be an indicator of cognitive load. Thus, subjective effort ratings during speech recognition in noise may covary both with signal-to-noise ratio (SNR) and individual cognitive capacity. The present study investigated the relation between subjective ratings of the effort involved in listening to speech in noise, speech recognition performance, and individual working memory (WM) capacity in hearing impaired hearing aid users. In two experiments, participants with hearing loss rated perceived effort during aided speech perception in noise. Noise type and SNR were manipulated in both experiments, and in the second experiment hearing aid compression release settings were also manipulated. Speech recognition performance was measured along with WM capacity. There were 46 participants in all with bilateral mild to moderate sloping hearing loss. In Experiment 1 there were 16 native Danish speakers (eight women and eight men) with a mean age of 63.5 yr (SD = 12.1) and average pure tone (PT) threshold of 47. 6 dB (SD = 9.8). In Experiment 2 there were 30 native Swedish speakers (19 women and 11 men) with a mean age of 70 yr (SD = 7.8) and average PT threshold of 45.8 dB (SD = 6.6). A visual analog scale (VAS) was used for effort rating in both experiments. In Experiment 1, effort was rated at individually adapted SNRs while in Experiment 2 it was rated at fixed SNRs. Speech recognition in noise performance was measured using adaptive procedures in both experiments with Dantale II sentences in Experiment 1 and Hagerman sentences in Experiment 2. WM capacity was measured using a letter-monitoring task in Experiment 1 and the reading span task in Experiment 2. In both experiments, there was a strong and significant relation between rated effort and SNR that was independent of individual WM capacity, whereas the relation between rated effort and noise type seemed to be influenced by individual WM capacity. Experiment 2 showed that hearing aid compression setting influenced rated effort. Subjective ratings of the effort involved in speech recognition in noise reflect SNRs, and individual cognitive capacity seems to influence relative rating of noise type. American Academy of Audiology.
Recognizing speech in a novel accent: the motor theory of speech perception reframed.
Moulin-Frier, Clément; Arbib, Michael A
2013-08-01
The motor theory of speech perception holds that we perceive the speech of another in terms of a motor representation of that speech. However, when we have learned to recognize a foreign accent, it seems plausible that recognition of a word rarely involves reconstruction of the speech gestures of the speaker rather than the listener. To better assess the motor theory and this observation, we proceed in three stages. Part 1 places the motor theory of speech perception in a larger framework based on our earlier models of the adaptive formation of mirror neurons for grasping, and for viewing extensions of that mirror system as part of a larger system for neuro-linguistic processing, augmented by the present consideration of recognizing speech in a novel accent. Part 2 then offers a novel computational model of how a listener comes to understand the speech of someone speaking the listener's native language with a foreign accent. The core tenet of the model is that the listener uses hypotheses about the word the speaker is currently uttering to update probabilities linking the sound produced by the speaker to phonemes in the native language repertoire of the listener. This, on average, improves the recognition of later words. This model is neutral regarding the nature of the representations it uses (motor vs. auditory). It serve as a reference point for the discussion in Part 3, which proposes a dual-stream neuro-linguistic architecture to revisits claims for and against the motor theory of speech perception and the relevance of mirror neurons, and extracts some implications for the reframing of the motor theory.
Model-based vision using geometric hashing
NASA Astrophysics Data System (ADS)
Akerman, Alexander, III; Patton, Ronald
1991-04-01
The Geometric Hashing technique developed by the NYU Courant Institute has been applied to various automatic target recognition applications. In particular, I-MATH has extended the hashing algorithm to perform automatic target recognition ofsynthetic aperture radar (SAR) imagery. For this application, the hashing is performed upon the geometric locations of dominant scatterers. In addition to being a robust model-based matching algorithm -- invariant under translation, scale, and 3D rotations of the target -- hashing is of particular utility because it can still perform effective matching when the target is partially obscured. Moreover, hashing is very amenable to a SIMD parallel processing architecture, and thus potentially realtime implementable.
Probst, Yasmine; Nguyen, Duc Thanh; Tran, Minh Khoi; Li, Wanqing
2015-07-27
Dietary assessment, while traditionally based on pen-and-paper, is rapidly moving towards automatic approaches. This study describes an Australian automatic food record method and its prototype for dietary assessment via the use of a mobile phone and techniques of image processing and pattern recognition. Common visual features including scale invariant feature transformation (SIFT), local binary patterns (LBP), and colour are used for describing food images. The popular bag-of-words (BoW) model is employed for recognizing the images taken by a mobile phone for dietary assessment. Technical details are provided together with discussions on the issues and future work.
Automatic Speech Recognition in Air Traffic Control: a Human Factors Perspective
NASA Technical Reports Server (NTRS)
Karlsson, Joakim
1990-01-01
The introduction of Automatic Speech Recognition (ASR) technology into the Air Traffic Control (ATC) system has the potential to improve overall safety and efficiency. However, because ASR technology is inherently a part of the man-machine interface between the user and the system, the human factors issues involved must be addressed. Here, some of the human factors problems are identified and related methods of investigation are presented. Research at M.I.T.'s Flight Transportation Laboratory is being conducted from a human factors perspective, focusing on intelligent parser design, presentation of feedback, error correction strategy design, and optimal choice of input modalities.
Rep. Connolly, Gerald E. [D-VA-11
2013-02-13
House - 02/13/2013 Referred to the Committee on House Administration, and in addition to the Committee on Oversight and Government Reform, for a period to be subsequently determined by the Speaker, in each case for consideration of such provisions as fall within the jurisdiction of... (All Actions) Tracker: This bill has the status IntroducedHere are the steps for Status of Legislation:
Landwehr, Markus; Fürstenberg, Dirk; Walger, Martin; von Wedel, Hasso; Meister, Hartmut
2014-01-01
Advances in speech coding strategies and electrode array designs for cochlear implants (CIs) predominantly aim at improving speech perception. Current efforts are also directed at transmitting appropriate cues of the fundamental frequency (F0) to the auditory nerve with respect to speech quality, prosody, and music perception. The aim of this study was to examine the effects of various electrode configurations and coding strategies on speech intonation identification, speaker gender identification, and music quality rating. In six MED-EL CI users electrodes were selectively deactivated in order to simulate different insertion depths and inter-electrode distances when using the high definition continuous interleaved sampling (HDCIS) and fine structure processing (FSP) speech coding strategies. Identification of intonation and speaker gender was determined and music quality rating was assessed. For intonation identification HDCIS was robust against the different electrode configurations, whereas fine structure processing showed significantly worse results when a short electrode depth was simulated. In contrast, speaker gender recognition was not affected by electrode configuration or speech coding strategy. Music quality rating was sensitive to electrode configuration. In conclusion, the three experiments revealed different outcomes, even though they all addressed the reception of F0 cues. Rapid changes in F0, as seen with intonation, were the most sensitive to electrode configurations and coding strategies. In contrast, electrode configurations and coding strategies did not show large effects when F0 information was available over a longer time period, as seen with speaker gender. Music quality relies on additional spectral cues other than F0, and was poorest when a shallow insertion was simulated.
Calandruccio, Lauren; Zhou, Haibo
2014-01-01
Purpose To examine whether improved speech recognition during linguistically mismatched target–masker experiments is due to linguistic unfamiliarity of the masker speech or linguistic dissimilarity between the target and masker speech. Method Monolingual English speakers (n = 20) and English–Greek simultaneous bilinguals (n = 20) listened to English sentences in the presence of competing English and Greek speech. Data were analyzed using mixed-effects regression models to determine differences in English recogition performance between the 2 groups and 2 masker conditions. Results Results indicated that English sentence recognition for monolinguals and simultaneous English–Greek bilinguals improved when the masker speech changed from competing English to competing Greek speech. Conclusion The improvement in speech recognition that has been observed for linguistically mismatched target–masker experiments cannot be simply explained by the masker language being linguistically unknown or unfamiliar to the listeners. Listeners can improve their speech recognition in linguistically mismatched target–masker experiments even when the listener is able to obtain meaningful linguistic information from the masker speech. PMID:24167230
Contour matching for a fish recognition and migration-monitoring system
NASA Astrophysics Data System (ADS)
Lee, Dah-Jye; Schoenberger, Robert B.; Shiozawa, Dennis; Xu, Xiaoqian; Zhan, Pengcheng
2004-12-01
Fish migration is being monitored year round to provide valuable information for the study of behavioral responses of fish to environmental variations. However, currently all monitoring is done by human observers. An automatic fish recognition and migration monitoring system is more efficient and can provide more accurate data. Such a system includes automatic fish image acquisition, contour extraction, fish categorization, and data storage. Shape is a very important characteristic and shape analysis and shape matching are studied for fish recognition. Previous work focused on finding critical landmark points on fish shape using curvature function analysis. Fish recognition based on landmark points has shown satisfying results. However, the main difficulty of this approach is that landmark points sometimes cannot be located very accurately. Whole shape matching is used for fish recognition in this paper. Several shape descriptors, such as Fourier descriptors, polygon approximation and line segments, are tested. A power cepstrum technique has been developed in order to improve the categorization speed using contours represented in tangent space with normalized length. Design and integration including image acquisition, contour extraction and fish categorization are discussed in this paper. Fish categorization results based on shape analysis and shape matching are also included.
Howell, Peter; Sackin, Stevie; Glenn, Kazan
2007-01-01
This program of work is intended to develop automatic recognition procedures to locate and assess stuttered dysfluencies. This and the following article together, develop and test recognizers for repetitions and prolongations. The automatic recognizers classify the speech in two stages: In the first, the speech is segmented and in the second the segments are categorized. The units that are segmented are words. Here assessments by human judges on the speech of 12 children who stutter are described using a corresponding procedure. The accuracy of word boundary placement across judges, categorization of the words as fluent, repetition or prolongation, and duration of the different fluency categories are reported. These measures allow reliable instances of repetitions and prolongations to be selected for training and assessing the recognizers in the subsequent paper. PMID:9328878
Fine grained recognition of masonry walls for built heritage assessment
NASA Astrophysics Data System (ADS)
Oses, N.; Dornaika, F.; Moujahid, A.
2015-01-01
This paper presents the ground work carried out to achieve automatic fine grained recognition of stone masonry. This is a necessary first step in the development of the analysis tool. The built heritage that will be assessed consists of stone masonry constructions and many of the features analysed can be characterized according to the geometry and arrangement of the stones. Much of the assessment is carried out through visual inspection. Thus, we apply image processing on digital images of the elements under inspection. The main contribution of the paper is the performance evaluation of the automatic categorization of masonry walls from a set of extracted straight line segments. The element chosen to perform this evaluation is the stone arrangement of masonry walls. The validity of the proposed framework is assessed on real images of masonry walls using machine learning paradigms. These include classifiers as well as automatic feature selection.
Boost OCR accuracy using iVector based system combination approach
NASA Astrophysics Data System (ADS)
Peng, Xujun; Cao, Huaigu; Natarajan, Prem
2015-01-01
Optical character recognition (OCR) is a challenging task because most existing preprocessing approaches are sensitive to writing style, writing material, noises and image resolution. Thus, a single recognition system cannot address all factors of real document images. In this paper, we describe an approach to combine diverse recognition systems by using iVector based features, which is a newly developed method in the field of speaker verification. Prior to system combination, document images are preprocessed and text line images are extracted with different approaches for each system, where iVector is transformed from a high-dimensional supervector of each text line and is used to predict the accuracy of OCR. We merge hypotheses from multiple recognition systems according to the overlap ratio and the predicted OCR score of text line images. We present evaluation results on an Arabic document database where the proposed method is compared against the single best OCR system using word error rate (WER) metric.
NASA Astrophysics Data System (ADS)
Sheng, Yehua; Zhang, Ka; Ye, Chun; Liang, Cheng; Li, Jian
2008-04-01
Considering the problem of automatic traffic sign detection and recognition in stereo images captured under motion conditions, a new algorithm for traffic sign detection and recognition based on features and probabilistic neural networks (PNN) is proposed in this paper. Firstly, global statistical color features of left image are computed based on statistics theory. Then for red, yellow and blue traffic signs, left image is segmented to three binary images by self-adaptive color segmentation method. Secondly, gray-value projection and shape analysis are used to confirm traffic sign regions in left image. Then stereo image matching is used to locate the homonymy traffic signs in right image. Thirdly, self-adaptive image segmentation is used to extract binary inner core shapes of detected traffic signs. One-dimensional feature vectors of inner core shapes are computed by central projection transformation. Fourthly, these vectors are input to the trained probabilistic neural networks for traffic sign recognition. Lastly, recognition results in left image are compared with recognition results in right image. If results in stereo images are identical, these results are confirmed as final recognition results. The new algorithm is applied to 220 real images of natural scenes taken by the vehicle-borne mobile photogrammetry system in Nanjing at different time. Experimental results show a detection and recognition rate of over 92%. So the algorithm is not only simple, but also reliable and high-speed on real traffic sign detection and recognition. Furthermore, it can obtain geometrical information of traffic signs at the same time of recognizing their types.
Improved Techniques for Automatic Chord Recognition from Music Audio Signals
ERIC Educational Resources Information Center
Cho, Taemin
2014-01-01
This thesis is concerned with the development of techniques that facilitate the effective implementation of capable automatic chord transcription from music audio signals. Since chord transcriptions can capture many important aspects of music, they are useful for a wide variety of music applications and also useful for people who learn and perform…
Automatic Cataloguing and Searching for Retrospective Data by Use of OCR Text.
ERIC Educational Resources Information Center
Tseng, Yuen-Hsien
2001-01-01
Describes efforts in supporting information retrieval from OCR (optical character recognition) degraded text. Reports on approaches used in an automatic cataloging and searching contest for books in multiple languages, including a vector space retrieval model, an n-gram indexing method, and a weighting scheme; and discusses problems of Asian…
RFID: A Revolution in Automatic Data Recognition
ERIC Educational Resources Information Center
Deal, Walter F., III
2004-01-01
Radio frequency identification, or RFID, is a generic term for technologies that use radio waves to automatically identify people or objects. There are several methods of identification, but the most common is to store a serial number that identifies a person or object, and perhaps other information, on a microchip that is attached to an antenna…
38 CFR 51.31 - Automatic recognition.
Code of Federal Regulations, 2012 CFR
2012-07-01
...) PER DIEM FOR NURSING HOME CARE OF VETERANS IN STATE HOMES Obtaining Per Diem for Nursing Home Care in... that already is recognized by VA as a State home for nursing home care at the time this part becomes effective, automatically will continue to be recognized as a State home for nursing home care but will be...
38 CFR 51.31 - Automatic recognition.
Code of Federal Regulations, 2011 CFR
2011-07-01
...) PER DIEM FOR NURSING HOME CARE OF VETERANS IN STATE HOMES Obtaining Per Diem for Nursing Home Care in... that already is recognized by VA as a State home for nursing home care at the time this part becomes effective, automatically will continue to be recognized as a State home for nursing home care but will be...
38 CFR 51.31 - Automatic recognition.
Code of Federal Regulations, 2013 CFR
2013-07-01
...) PER DIEM FOR NURSING HOME CARE OF VETERANS IN STATE HOMES Obtaining Per Diem for Nursing Home Care in... that already is recognized by VA as a State home for nursing home care at the time this part becomes effective, automatically will continue to be recognized as a State home for nursing home care but will be...
38 CFR 51.31 - Automatic recognition.
Code of Federal Regulations, 2014 CFR
2014-07-01
...) PER DIEM FOR NURSING HOME CARE OF VETERANS IN STATE HOMES Obtaining Per Diem for Nursing Home Care in... that already is recognized by VA as a State home for nursing home care at the time this part becomes effective, automatically will continue to be recognized as a State home for nursing home care but will be...
38 CFR 51.31 - Automatic recognition.
Code of Federal Regulations, 2010 CFR
2010-07-01
...) PER DIEM FOR NURSING HOME CARE OF VETERANS IN STATE HOMES Obtaining Per Diem for Nursing Home Care in... that already is recognized by VA as a State home for nursing home care at the time this part becomes effective, automatically will continue to be recognized as a State home for nursing home care but will be...
Investigating Prompt Difficulty in an Automatically Scored Speaking Performance Assessment
ERIC Educational Resources Information Center
Cox, Troy L.
2013-01-01
Speaking assessments for second language learners have traditionally been expensive to administer because of the cost of rating the speech samples. To reduce the cost, many researchers are investigating the potential of using automatic speech recognition (ASR) as a means to score examinee responses to open-ended prompts. This study examined the…
Computer-Aided Authoring System (AUTHOR) User's Guide. Volume I. Final Report.
ERIC Educational Resources Information Center
Guitard, Charles R.
This user's guide for AUTHOR, an automatic authoring system which produces programmed texts for teaching symbol recognition, provides detailed instructions to help the user construct and enter the information needed to create the programmed text, run the AUTHOR program, and edit the automatically composed paper. Major sections describe steps in…
Psychopaths lack the automatic avoidance of social threat: relation to instrumental aggression.
Louise von Borries, Anna Katinka; Volman, Inge; de Bruijn, Ellen Rosalia Aloïs; Bulten, Berend Hendrik; Verkes, Robbert Jan; Roelofs, Karin
2012-12-30
Psychopathy (PP) is associated with marked abnormalities in social emotional behaviour, such as high instrumental aggression (IA). A crucial but largely ignored question is whether automatic social approach-avoidance tendencies may underlie this condition. We tested whether offenders with PP show lack of automatic avoidance tendencies, usually activated when (healthy) individuals are confronted with social threat stimuli (angry faces). We applied a computerized approach-avoidance task (AAT), where participants pushed or pulled pictures of emotional faces using a joystick, upon which the faces decreased or increased in size, respectively. Furthermore, participants completed an emotion recognition task which was used to control for differences in recognition of facial emotions. In contrast to healthy controls (HC), PP patients showed total absence of avoidance tendencies towards angry faces. Interestingly, those responses were related to levels of instrumental aggression and the (in)ability to experience personal distress (PD). These findings suggest that social performance in psychopaths is disturbed on a basic level of automatic action tendencies. The lack of implicit threat avoidance tendencies may underlie their aggressive behaviour. Copyright © 2012 Elsevier Ireland Ltd. All rights reserved.
Pakulak, Eric; Neville, Helen J.
2010-01-01
While anecdotally there appear to be differences in the way native speakers use and comprehend their native language, most empirical investigations of language processing study university students and none have studied differences in language proficiency which may be independent of resource limitations such as working memory span. We examined differences in language proficiency in adult monolingual native speakers of English using an event-related potential (ERP) paradigm. ERPs were recorded to insertion phrase structure violations in naturally spoken English sentences. Participants recruited from a wide spectrum of society were given standardized measures of English language proficiency, and two complementary ERP analyses were performed. In between-groups analyses, participants were divided, based on standardized proficiency scores, into Lower Proficiency (LP) and Higher Proficiency (HP) groups. Compared to LP participants, HP participants showed an early anterior negativity that was more focal, both spatially and temporally, and a larger and more widely distributed positivity (P600) to violations. In correlational analyses, we utilized a wide spectrum of proficiency scores to examine the degree to which individual proficiency scores correlated with individual neural responses to syntactic violations in regions and time windows identified in the between-group analyses. This approach also employed partial correlation analyses to control for possible confounding variables. These analyses provided evidence for the effects of proficiency that converged with the between-groups analyses. These results suggest that adult monolingual native speakers of English who vary in language proficiency differ in the recruitment of syntactic processes that are hypothesized to be at least in part automatic as well as of those thought to be more controlled. These results also suggest that in order to fully characterize neural organization for language in native speakers it is necessary to include participants of varying proficiency. PMID:19925188
Razza, Sergio; Zaccone, Monica; Meli, Aannalisa; Cristofari, Eliana
2017-12-01
Children affected by hearing loss can experience difficulties in challenging and noisy environments even when deafness is corrected by Cochlear implant (CI) devices. These patients have a selective attention deficit in multiple listening conditions. At present, the most effective ways to improve the performance of speech recognition in noise consists of providing CI processors with noise reduction algorithms and of providing patients with bilateral CIs. The aim of this study was to compare speech performances in noise, across increasing noise levels, in CI recipients using two kinds of wireless remote-microphone radio systems that use digital radio frequency transmission: the Roger Inspiro accessory and the Cochlear Wireless Mini Microphone accessory. Eleven Nucleus Cochlear CP910 CI young user subjects were studied. The signal/noise ratio, at a speech reception threshold (SRT) value of 50%, was measured in different conditions for each patient: with CI only, with the Roger or with the MiniMic accessory. The effect of the application of the SNR-noise reduction algorithm in each of these conditions was also assessed. The tests were performed with the subject positioned in front of the main speaker, at a distance of 2.5 m. Another two speakers were positioned at 3.50 m. The main speaker at 65 dB issued disyllabic words. Babble noise signal was delivered through the other speakers, with variable intensity. The use of both wireless remote microphones improved the SRT results. Both systems improved gain of speech performances. The gain was higher with the Mini Mic system (SRT = -4.76) than the Roger system (SRT = -3.01). The addition of the NR algorithm did not statistically further improve the results. There is significant improvement in speech recognition results with both wireless digital remote microphone accessories, in particular with the Mini Mic system when used with the CP910 processor. The use of a remote microphone accessory surpasses the benefit of application of NR algorithm. Copyright © 2017. Published by Elsevier B.V.
Voice-processing technologies--their application in telecommunications.
Wilpon, J G
1995-01-01
As the telecommunications industry evolves over the next decade to provide the products and services that people will desire, several key technologies will become commonplace. Two of these, automatic speech recognition and text-to-speech synthesis, will provide users with more freedom on when, where, and how they access information. While these technologies are currently in their infancy, their capabilities are rapidly increasing and their deployment in today's telephone network is expanding. The economic impact of just one application, the automation of operator services, is well over $100 million per year. Yet there still are many technical challenges that must be resolved before these technologies can be deployed ubiquitously in products and services throughout the worldwide telephone network. These challenges include: (i) High level of accuracy. The technology must be perceived by the user as highly accurate, robust, and reliable. (ii) Easy to use. Speech is only one of several possible input/output modalities for conveying information between a human and a machine, much like a computer terminal or Touch-Tone pad on a telephone. It is not the final product. Therefore, speech technologies must be hidden from the user. That is, the burden of using the technology must be on the technology itself. (iii) Quick prototyping and development of new products and services. The technology must support the creation of new products and services based on speech in an efficient and timely fashion. In this paper I present a vision of the voice-processing industry with a focus on the areas with the broadest base of user penetration: speech recognition, text-to-speech synthesis, natural language processing, and speaker recognition technologies. The current and future applications of these technologies in the telecommunications industry will be examined in terms of their strengths, limitations, and the degree to which user needs have been or have yet to be met. Although noteworthy gains have been made in areas with potentially small user bases and in the more mature speech-coding technologies, these subjects are outside the scope of this paper. Images Fig. 1 PMID:7479815
Higgins, Eleanor L; Raskind, Marshall H
2004-12-01
This study was conducted to assess the effectiveness of two programs developed by the Frostig Center Research Department to improve the reading and spelling of students with learning disabilities (LD): a computer Speech Recognition-based Program (SRBP) and a computer and text-based Automaticity Program (AP). Twenty-eight LD students with reading and spelling difficulties (aged 8 to 18) received each program for 17 weeks and were compared with 16 students in a contrast group who did not receive either program. After adjusting for age and IQ, both the SRBP and AP groups showed significant differences over the contrast group in improving word recognition and reading comprehension. Neither program showed significant differences over contrasts in spelling. The SRBP also improved the performance of the target group when compared with the contrast group on phonological elision and nonword reading efficiency tasks. The AP showed significant differences in all process and reading efficiency measures.
The Automaticity of Emotional Face-Context Integration
Aviezer, Hillel; Dudarev, Veronica; Bentin, Shlomo; Hassin, Ran R.
2011-01-01
Recent studies have demonstrated that context can dramatically influence the recognition of basic facial expressions, yet the nature of this phenomenon is largely unknown. In the present paper we begin to characterize the underlying process of face-context integration. Specifically, we examine whether it is a relatively controlled or automatic process. In Experiment 1 participants were motivated and instructed to avoid using the context while categorizing contextualized facial expression, or they were led to believe that the context was irrelevant. Nevertheless, they were unable to disregard the context, which exerted a strong effect on their emotion recognition. In Experiment 2, participants categorized contextualized facial expressions while engaged in a concurrent working memory task. Despite the load, the context exerted a strong influence on their recognition of facial expressions. These results suggest that facial expressions and their body contexts are integrated in an unintentional, uncontrollable, and relatively effortless manner. PMID:21707150
ATR applications of minimax entropy models of texture and shape
NASA Astrophysics Data System (ADS)
Zhu, Song-Chun; Yuille, Alan L.; Lanterman, Aaron D.
2001-10-01
Concepts from information theory have recently found favor in both the mainstream computer vision community and the military automatic target recognition community. In the computer vision literature, the principles of minimax entropy learning theory have been used to generate rich probabilitistic models of texture and shape. In addition, the method of types and large deviation theory has permitted the difficulty of various texture and shape recognition tasks to be characterized by 'order parameters' that determine how fundamentally vexing a task is, independent of the particular algorithm used. These information-theoretic techniques have been demonstrated using traditional visual imagery in applications such as simulating cheetah skin textures and such as finding roads in aerial imagery. We discuss their application to problems in the specific application domain of automatic target recognition using infrared imagery. We also review recent theoretical and algorithmic developments which permit learning minimax entropy texture models for infrared textures in reasonable timeframes.
NASA Astrophysics Data System (ADS)
Chen, Dan; Guo, Lin-yuan; Wang, Chen-hao; Ke, Xi-zheng
2017-07-01
Equalization can compensate channel distortion caused by channel multipath effects, and effectively improve convergent of modulation constellation diagram in optical wireless system. In this paper, the subspace blind equalization algorithm is used to preprocess M-ary phase shift keying (MPSK) subcarrier modulation signal in receiver. Mountain clustering is adopted to get the clustering centers of MPSK modulation constellation diagram, and the modulation order is automatically identified through the k-nearest neighbor (KNN) classifier. The experiment has been done under four different weather conditions. Experimental results show that the convergent of constellation diagram is improved effectively after using the subspace blind equalization algorithm, which means that the accuracy of modulation recognition is increased. The correct recognition rate of 16PSK can be up to 85% in any kind of weather condition which is mentioned in paper. Meanwhile, the correct recognition rate is the highest in cloudy and the lowest in heavy rain condition.
Pärkkä, Juha; Cluitmans, Luc; Ermes, Miikka
2010-09-01
Inactive and sedentary lifestyle is a major problem in many industrialized countries today. Automatic recognition of type of physical activity can be used to show the user the distribution of his daily activities and to motivate him into more active lifestyle. In this study, an automatic activity-recognition system consisting of wireless motion bands and a PDA is evaluated. The system classifies raw sensor data into activity types online. It uses a decision tree classifier, which has low computational cost and low battery consumption. The classifier parameters can be personalized online by performing a short bout of an activity and by telling the system which activity is being performed. Data were collected with seven volunteers during five everyday activities: lying, sitting/standing, walking, running, and cycling. The online system can detect these activities with overall 86.6% accuracy and with 94.0% accuracy after classifier personalization.
Activity Recognition for Personal Time Management
NASA Astrophysics Data System (ADS)
Prekopcsák, Zoltán; Soha, Sugárka; Henk, Tamás; Gáspár-Papanek, Csaba
We describe an accelerometer based activity recognition system for mobile phones with a special focus on personal time management. We compare several data mining algorithms for the automatic recognition task in the case of single user and multiuser scenario, and improve accuracy with heuristics and advanced data mining methods. The results show that daily activities can be recognized with high accuracy and the integration with the RescueTime software can give good insights for personal time management.
Performance of a Working Face Recognition Machine using Cortical Thought Theory
1984-12-04
been considered (2). Recommendations from Bledsoe’s study included research on facial - recognition systems that are "completely automatic (remove the...C. L. Location of some facial features . computer, Palo Alto: Panoramic Research, Aug 1966. 2. Bledsoe, W. W. Man-machine facial recognition : Is...34 image?" It would seem - that the location and size of the features left in this contrast-expanded image contain the essential information of facial
Facial recognition in education system
NASA Astrophysics Data System (ADS)
Krithika, L. B.; Venkatesh, K.; Rathore, S.; Kumar, M. Harish
2017-11-01
Human beings exploit emotions comprehensively for conveying messages and their resolution. Emotion detection and face recognition can provide an interface between the individuals and technologies. The most successful applications of recognition analysis are recognition of faces. Many different techniques have been used to recognize the facial expressions and emotion detection handle varying poses. In this paper, we approach an efficient method to recognize the facial expressions to track face points and distances. This can automatically identify observer face movements and face expression in image. This can capture different aspects of emotion and facial expressions.
Lee, Kichol; Casali, John G
2016-01-01
To investigate the effect of controlled low-speed wind-noise on the auditory situation awareness performance afforded by military hearing protection/enhancement devices (HPED) and tactical communication and protective systems (TCAPS). Recognition/identification and pass-through communications tasks were separately conducted under three wind conditions (0, 5, and 10 mph). Subjects wore two in-ear-type TCAPS, one earmuff-type TCAPS, a Combat Arms Earplug in its 'open' or pass-through setting, and an EB-15LE electronic earplug. Devices with electronic gain systems were tested under two gain settings: 'unity' and 'max'. Testing without any device (open ear) was conducted as a control. Ten subjects were recruited from the student population at Virginia Tech. Audiometric requirements were 25 dBHL or better at 500, 1000, 2000, 4000, and 8000 Hz in both ears. Performance on the interaction of communication task-by-device was significantly different only in 0 mph wind speed. The between-device performance differences varied with azimuthal speaker locations. It is evident from this study that stable (non-gusting) wind speeds up to 10 mph did not significantly degrade recognition/identification task performance and pass-through communication performance of the group of HPEDs and TCAPS tested. However, the various devices performed differently as the test sound signal speaker location was varied and it appears that physical as well as electronic features may have contributed to this directional result.
[Study on the automatic parameters identification of water pipe network model].
Jia, Hai-Feng; Zhao, Qi-Feng
2010-01-01
Based on the problems analysis on development and application of water pipe network model, the model parameters automatic identification is regarded as a kernel bottleneck of model's application in water supply enterprise. The methodology of water pipe network model parameters automatic identification based on GIS and SCADA database is proposed. Then the kernel algorithm of model parameters automatic identification is studied, RSA (Regionalized Sensitivity Analysis) is used for automatic recognition of sensitive parameters, and MCS (Monte-Carlo Sampling) is used for automatic identification of parameters, the detail technical route based on RSA and MCS is presented. The module of water pipe network model parameters automatic identification is developed. At last, selected a typical water pipe network as a case, the case study on water pipe network model parameters automatic identification is conducted and the satisfied results are achieved.
Speech information retrieval: a review
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hafen, Ryan P.; Henry, Michael J.
Audio is an information-rich component of multimedia. Information can be extracted from audio in a number of different ways, and thus there are several established audio signal analysis research fields. These fields include speech recognition, speaker recognition, audio segmentation and classification, and audio finger-printing. The information that can be extracted from tools and methods developed in these fields can greatly enhance multimedia systems. In this paper, we present the current state of research in each of the major audio analysis fields. The goal is to introduce enough back-ground for someone new in the field to quickly gain high-level understanding andmore » to provide direction for further study.« less
Transcribe Your Class: Using Speech Recognition to Improve Access for At-Risk Students
ERIC Educational Resources Information Center
Bain, Keith; Lund-Lucas, Eunice; Stevens, Janice
2012-01-01
Through a project supported by Canada's Social Development Partnerships Program, a team of leading National Disability Organizations, universities, and industry partners are piloting a prototype Hosted Transcription Service that uses speech recognition to automatically create multimedia transcripts that can be used by students for study purposes.…
ERIC Educational Resources Information Center
Cordier, Deborah
2009-01-01
A renewed focus on foreign language (FL) learning and speech for communication has resulted in computer-assisted language learning (CALL) software developed with Automatic Speech Recognition (ASR). ASR features for FL pronunciation (Lafford, 2004) are functional components of CALL designs used for FL teaching and learning. The ASR features…
ERIC Educational Resources Information Center
Spironelli, Chiara; Penolazzi, Barbara; Vio, Claudio; Angrilli, Alessandro
2010-01-01
Brain plasticity was investigated in 14 Italian children affected by developmental dyslexia after 6 months of phonological training. The means used to measure language reorganization was the recognition potential, an early wave, also called N150, elicited by automatic word recognition. This component peaks over the left temporo-occipital cortex…
Higher-order neural network software for distortion invariant object recognition
NASA Technical Reports Server (NTRS)
Reid, Max B.; Spirkovska, Lilly
1991-01-01
The state-of-the-art in pattern recognition for such applications as automatic target recognition and industrial robotic vision relies on digital image processing. We present a higher-order neural network model and software which performs the complete feature extraction-pattern classification paradigm required for automatic pattern recognition. Using a third-order neural network, we demonstrate complete, 100 percent accurate invariance to distortions of scale, position, and in-plate rotation. In a higher-order neural network, feature extraction is built into the network, and does not have to be learned. Only the relatively simple classification step must be learned. This is key to achieving very rapid training. The training set is much smaller than with standard neural network software because the higher-order network only has to be shown one view of each object to be learned, not every possible view. The software and graphical user interface run on any Sun workstation. Results of the use of the neural software in autonomous robotic vision systems are presented. Such a system could have extensive application in robotic manufacturing.
3D automatic anatomy recognition based on iterative graph-cut-ASM
NASA Astrophysics Data System (ADS)
Chen, Xinjian; Udupa, Jayaram K.; Bagci, Ulas; Alavi, Abass; Torigian, Drew A.
2010-02-01
We call the computerized assistive process of recognizing, delineating, and quantifying organs and tissue regions in medical imaging, occurring automatically during clinical image interpretation, automatic anatomy recognition (AAR). The AAR system we are developing includes five main parts: model building, object recognition, object delineation, pathology detection, and organ system quantification. In this paper, we focus on the delineation part. For the modeling part, we employ the active shape model (ASM) strategy. For recognition and delineation, we integrate several hybrid strategies of combining purely image based methods with ASM. In this paper, an iterative Graph-Cut ASM (IGCASM) method is proposed for object delineation. An algorithm called GC-ASM was presented at this symposium last year for object delineation in 2D images which attempted to combine synergistically ASM and GC. Here, we extend this method to 3D medical image delineation. The IGCASM method effectively combines the rich statistical shape information embodied in ASM with the globally optimal delineation capability of the GC method. We propose a new GC cost function, which effectively integrates the specific image information with the ASM shape model information. The proposed methods are tested on a clinical abdominal CT data set. The preliminary results show that: (a) it is feasible to explicitly bring prior 3D statistical shape information into the GC framework; (b) the 3D IGCASM delineation method improves on ASM and GC and can provide practical operational time on clinical images.
ERIC Educational Resources Information Center
Nickerson, Catherine
2015-01-01
The impact of globalisation in the last 20 years has led to an overwhelming increase in the use of English as the medium through which many business people get their work done. As a result, the linguistic landscape within which we now operate as researchers and teachers has changed both rapidly and beyond all recognition. In the discussion below,…
Limited connected speech experiment
NASA Astrophysics Data System (ADS)
Landell, P. B.
1983-03-01
The purpose of this contract was to demonstrate that connected Speech Recognition (CSR) can be performed in real-time on a vocabulary of one hundred words and to test the performance of the CSR system for twenty-five male and twenty-five female speakers. This report describes the contractor's real-time laboratory CSR system, the data base and training software developed in accordance with the contract, and the results of the performance tests.
Combining Multiple Knowledge Sources for Speech Recognition
1988-09-15
Thus, the first is thle to clarify the pronunciationt ( TASSEAJ for the acronym TASA !). best adaptation sentence, the second sentence, whens addled...10 rapid adapltati,,n sen- tenrces, and 15 spell-i,, de phrases. 6101 resource rirailageo lei SPEAKER-DEPENDENT DATABASE sentences were randortily...combining the smoothed phoneme models with the de - system tested on a standard database using two well de . tailed context models. BYBLOS makes maximal use
ERIC Educational Resources Information Center
Ashwell, Tim; Elam, Jesse R.
2017-01-01
The ultimate aim of our research project was to use the Google Web Speech API to automate scoring of elicited imitation (EI) tests. However, in order to achieve this goal, we had to take a number of preparatory steps. We needed to assess how accurate this speech recognition tool is in recognizing native speakers' production of the test items; we…
Gisting Technique Development.
1981-12-01
furnished tapes (" Stonehenge " database) which were used for previous contracts. Recognition results for English male and female speakers are presented in...independent " Stonehenge " test data. A variety of options in generating word arrays were tried; the results below describe the most successful of these. The...time to carry out any quantitative tests, ............. Page 22 even the obvious one of retraining the " Stonehenge " English vocabulary on-line, we
Early-stage chunking of finger tapping sequences by persons who stutter and fluent speakers.
Smits-Bandstra, Sarah; De Nil, Luc F
2013-01-01
This research note explored the hypothesis that chunking differences underlie the slow finger-tap sequencing performance reported in the literature for persons who stutter (PWS) relative to fluent speakers (PNS). Early-stage chunking was defined as an immediate and spontaneous tendency to organize a long sequence into pauses, for motor planning, and chunks of fluent motor performance. A previously published study in which 12 PWS and 12 matched PNS practised a 10-item finger tapping sequence 30 times was examined. Both groups significantly decreased the duration of between-chunk intervals (BCIs) and within-chunk intervals (WCIs) over practice. PNS had significantly shorter WCIs relative to PWS, but minimal differences between groups were found for the number of, or duration of, BCI. Results imply that sequencing differences found between PNS and PWS may be due to differences in automatizing movements within chunks or retrieving chunks from memory rather than chunking per se.
Crossmodal plasticity in the fusiform gyrus of late blind individuals during voice recognition.
Hölig, Cordula; Föcker, Julia; Best, Anna; Röder, Brigitte; Büchel, Christian
2014-12-01
Blind individuals are trained in identifying other people through voices. In congenitally blind adults the anterior fusiform gyrus has been shown to be active during voice recognition. Such crossmodal changes have been associated with a superiority of blind adults in voice perception. The key question of the present functional magnetic resonance imaging (fMRI) study was whether visual deprivation that occurs in adulthood is followed by similar adaptive changes of the voice identification system. Late blind individuals and matched sighted participants were tested in a priming paradigm, in which two voice stimuli were subsequently presented. The prime (S1) and the target (S2) were either from the same speaker (person-congruent voices) or from two different speakers (person-incongruent voices). Participants had to classify the S2 as either coming from an old or a young person. Only in late blind but not in matched sighted controls, the activation in the anterior fusiform gyrus was modulated by voice identity: late blind volunteers showed an increase of the BOLD signal in response to person-incongruent compared with person-congruent trials. These results suggest that the fusiform gyrus adapts to input of a new modality even in the mature brain and thus demonstrate an adult type of crossmodal plasticity. Copyright © 2014 Elsevier Inc. All rights reserved.
SNR-adaptive stream weighting for audio-MES ASR.
Lee, Ki-Seung
2008-08-01
Myoelectric signals (MESs) from the speaker's mouth region have been successfully shown to improve the noise robustness of automatic speech recognizers (ASRs), thus promising to extend their usability in implementing noise-robust ASR. In the recognition system presented herein, extracted audio and facial MES features were integrated by a decision fusion method, where the likelihood score of the audio-MES observation vector was given by a linear combination of class-conditional observation log-likelihoods of two classifiers, using appropriate weights. We developed a weighting process adaptive to SNRs. The main objective of the paper involves determining the optimal SNR classification boundaries and constructing a set of optimum stream weights for each SNR class. These two parameters were determined by a method based on a maximum mutual information criterion. Acoustic and facial MES data were collected from five subjects, using a 60-word vocabulary. Four types of acoustic noise including babble, car, aircraft, and white noise were acoustically added to clean speech signals with SNR ranging from -14 to 31 dB. The classification accuracy of the audio ASR was as low as 25.5%. Whereas, the classification accuracy of the MES ASR was 85.2%. The classification accuracy could be further improved by employing the proposed audio-MES weighting method, which was as high as 89.4% in the case of babble noise. A similar result was also found for the other types of noise.
A comparison of algorithms for inference and learning in probabilistic graphical models.
Frey, Brendan J; Jojic, Nebojsa
2005-09-01
Research into methods for reasoning under uncertainty is currently one of the most exciting areas of artificial intelligence, largely because it has recently become possible to record, store, and process large amounts of data. While impressive achievements have been made in pattern classification problems such as handwritten character recognition, face detection, speaker identification, and prediction of gene function, it is even more exciting that researchers are on the verge of introducing systems that can perform large-scale combinatorial analyses of data, decomposing the data into interacting components. For example, computational methods for automatic scene analysis are now emerging in the computer vision community. These methods decompose an input image into its constituent objects, lighting conditions, motion patterns, etc. Two of the main challenges are finding effective representations and models in specific applications and finding efficient algorithms for inference and learning in these models. In this paper, we advocate the use of graph-based probability models and their associated inference and learning algorithms. We review exact techniques and various approximate, computationally efficient techniques, including iterated conditional modes, the expectation maximization (EM) algorithm, Gibbs sampling, the mean field method, variational techniques, structured variational techniques and the sum-product algorithm ("loopy" belief propagation). We describe how each technique can be applied in a vision model of multiple, occluding objects and contrast the behaviors and performances of the techniques using a unifying cost function, free energy.
NASA Astrophysics Data System (ADS)
Hassanat, Ahmad B. A.; Jassim, Sabah
2010-04-01
In this paper, the automatic lip reading problem is investigated, and an innovative approach to providing solutions to this problem has been proposed. This new VSR approach is dependent on the signature of the word itself, which is obtained from a hybrid feature extraction method dependent on geometric, appearance, and image transform features. The proposed VSR approach is termed "visual words". The visual words approach consists of two main parts, 1) Feature extraction/selection, and 2) Visual speech feature recognition. After localizing face and lips, several visual features for the lips where extracted. Such as the height and width of the mouth, mutual information and the quality measurement between the DWT of the current ROI and the DWT of the previous ROI, the ratio of vertical to horizontal features taken from DWT of ROI, The ratio of vertical edges to horizontal edges of ROI, the appearance of the tongue and the appearance of teeth. Each spoken word is represented by 8 signals, one of each feature. Those signals maintain the dynamic of the spoken word, which contains a good portion of information. The system is then trained on these features using the KNN and DTW. This approach has been evaluated using a large database for different people, and large experiment sets. The evaluation has proved the visual words efficiency, and shown that the VSR is a speaker dependent problem.
Information Processing Techniques Program. Volume 1. Packet Speech Systems Technology
1980-03-31
DMA transfer is enabled from the 2652 serial I/O device to the buffer memory. This enables automatic recep- tion of an incoming packet without (’PU...conference speaker. Producing multiple copies at the source wastes network bandwidth and is likely to cause local overload conditions for a large... wasted . If the setup fails because ST can fird no route with sufficient capacity, the phone will have rung and possibly been answered 18 but the call will
Performance of wavelet analysis and neural networks for pathological voices identification
NASA Astrophysics Data System (ADS)
Salhi, Lotfi; Talbi, Mourad; Abid, Sabeur; Cherif, Adnane
2011-09-01
Within the medical environment, diverse techniques exist to assess the state of the voice of the patient. The inspection technique is inconvenient for a number of reasons, such as its high cost, the duration of the inspection, and above all, the fact that it is an invasive technique. This study focuses on a robust, rapid and accurate system for automatic identification of pathological voices. This system employs non-invasive, non-expensive and fully automated method based on hybrid approach: wavelet transform analysis and neural network classifier. First, we present the results obtained in our previous study while using classic feature parameters. These results allow visual identification of pathological voices. Second, quantified parameters drifting from the wavelet analysis are proposed to characterise the speech sample. On the other hand, a system of multilayer neural networks (MNNs) has been developed which carries out the automatic detection of pathological voices. The developed method was evaluated using voice database composed of recorded voice samples (continuous speech) from normophonic or dysphonic speakers. The dysphonic speakers were patients of a National Hospital 'RABTA' of Tunis Tunisia and a University Hospital in Brussels, Belgium. Experimental results indicate a success rate ranging between 75% and 98.61% for discrimination of normal and pathological voices using the proposed parameters and neural network classifier. We also compared the average classification rate based on the MNN, Gaussian mixture model and support vector machines.
Li, Wenbo; Zhao, Sheng; Wu, Nan; Zhong, Junwen; Wang, Bo; Lin, Shizhe; Chen, Shuwen; Yuan, Fang; Jiang, Hulin; Xiao, Yongjun; Hu, Bin; Zhou, Jun
2017-07-19
Wearable active sensors have extensive applications in mobile biosensing and human-machine interaction but require good flexibility, high sensitivity, excellent stability, and self-powered feature. In this work, cellular polypropylene (PP) piezoelectret was chosen as the core material of a sensitivity-enhanced wearable active voiceprint sensor (SWAVS) to realize voiceprint recognition. By virtue of the dipole orientation control method, the air layers in the piezoelectret were efficiently utilized, and the current sensitivity was enhanced (from 1.98 pA/Hz to 5.81 pA/Hz at 115 dB). The SWAVS exhibited the superiorities of high sensitivity, accurate frequency response, and excellent stability. The voiceprint recognition system could make correct reactions to human voices by judging both the password and speaker. This study presented a voiceprint sensor with potential applications in noncontact biometric recognition and safety guarantee systems, promoting the progress of wearable sensor networks.
Han, Guanghui; Liu, Xiabi; Zheng, Guangyuan; Wang, Murong; Huang, Shan
2018-06-06
Ground-glass opacity (GGO) is a common CT imaging sign on high-resolution CT, which means the lesion is more likely to be malignant compared to common solid lung nodules. The automatic recognition of GGO CT imaging signs is of great importance for early diagnosis and possible cure of lung cancers. The present GGO recognition methods employ traditional low-level features and system performance improves slowly. Considering the high-performance of CNN model in computer vision field, we proposed an automatic recognition method of 3D GGO CT imaging signs through the fusion of hybrid resampling and layer-wise fine-tuning CNN models in this paper. Our hybrid resampling is performed on multi-views and multi-receptive fields, which reduces the risk of missing small or large GGOs by adopting representative sampling panels and processing GGOs with multiple scales simultaneously. The layer-wise fine-tuning strategy has the ability to obtain the optimal fine-tuning model. Multi-CNN models fusion strategy obtains better performance than any single trained model. We evaluated our method on the GGO nodule samples in publicly available LIDC-IDRI dataset of chest CT scans. The experimental results show that our method yields excellent results with 96.64% sensitivity, 71.43% specificity, and 0.83 F1 score. Our method is a promising approach to apply deep learning method to computer-aided analysis of specific CT imaging signs with insufficient labeled images. Graphical abstract We proposed an automatic recognition method of 3D GGO CT imaging signs through the fusion of hybrid resampling and layer-wise fine-tuning CNN models in this paper. Our hybrid resampling reduces the risk of missing small or large GGOs by adopting representative sampling panels and processing GGOs with multiple scales simultaneously. The layer-wise fine-tuning strategy has ability to obtain the optimal fine-tuning model. Our method is a promising approach to apply deep learning method to computer-aided analysis of specific CT imaging signs with insufficient labeled images.
Automatic micropropagation of plants--the vision-system: graph rewriting as pattern recognition
NASA Astrophysics Data System (ADS)
Schwanke, Joerg; Megnet, Roland; Jensch, Peter F.
1993-03-01
The automation of plant-micropropagation is necessary to produce high amounts of biomass. Plants have to be dissected on particular cutting-points. A vision-system is needed for the recognition of the cutting-points on the plants. With this background, this contribution is directed to the underlying formalism to determine cutting-points on abstract-plant models. We show the usefulness of pattern recognition by graph-rewriting along with some examples in this context.
Probst, Yasmine; Nguyen, Duc Thanh; Tran, Minh Khoi; Li, Wanqing
2015-01-01
Dietary assessment, while traditionally based on pen-and-paper, is rapidly moving towards automatic approaches. This study describes an Australian automatic food record method and its prototype for dietary assessment via the use of a mobile phone and techniques of image processing and pattern recognition. Common visual features including scale invariant feature transformation (SIFT), local binary patterns (LBP), and colour are used for describing food images. The popular bag-of-words (BoW) model is employed for recognizing the images taken by a mobile phone for dietary assessment. Technical details are provided together with discussions on the issues and future work. PMID:26225994
Error Rates in Users of Automatic Face Recognition Software
White, David; Dunn, James D.; Schmid, Alexandra C.; Kemp, Richard I.
2015-01-01
In recent years, wide deployment of automatic face recognition systems has been accompanied by substantial gains in algorithm performance. However, benchmarking tests designed to evaluate these systems do not account for the errors of human operators, who are often an integral part of face recognition solutions in forensic and security settings. This causes a mismatch between evaluation tests and operational accuracy. We address this by measuring user performance in a face recognition system used to screen passport applications for identity fraud. Experiment 1 measured target detection accuracy in algorithm-generated ‘candidate lists’ selected from a large database of passport images. Accuracy was notably poorer than in previous studies of unfamiliar face matching: participants made over 50% errors for adult target faces, and over 60% when matching images of children. Experiment 2 then compared performance of student participants to trained passport officers–who use the system in their daily work–and found equivalent performance in these groups. Encouragingly, a group of highly trained and experienced “facial examiners” outperformed these groups by 20 percentage points. We conclude that human performance curtails accuracy of face recognition systems–potentially reducing benchmark estimates by 50% in operational settings. Mere practise does not attenuate these limits, but superior performance of trained examiners suggests that recruitment and selection of human operators, in combination with effective training and mentorship, can improve the operational accuracy of face recognition systems. PMID:26465631
Pinheiro, Ana P; Rezaii, Neguine; Nestor, Paul G; Rauber, Andréia; Spencer, Kevin M; Niznikiewicz, Margaret
2016-02-01
During speech comprehension, multiple cues need to be integrated at a millisecond speed, including semantic information, as well as voice identity and affect cues. A processing advantage has been demonstrated for self-related stimuli when compared with non-self stimuli, and for emotional relative to neutral stimuli. However, very few studies investigated self-other speech discrimination and, in particular, how emotional valence and voice identity interactively modulate speech processing. In the present study we probed how the processing of words' semantic valence is modulated by speaker's identity (self vs. non-self voice). Sixteen healthy subjects listened to 420 prerecorded adjectives differing in voice identity (self vs. non-self) and semantic valence (neutral, positive and negative), while electroencephalographic data were recorded. Participants were instructed to decide whether the speech they heard was their own (self-speech condition), someone else's (non-self speech), or if they were unsure. The ERP results demonstrated interactive effects of speaker's identity and emotional valence on both early (N1, P2) and late (Late Positive Potential - LPP) processing stages: compared with non-self speech, self-speech with neutral valence elicited more negative N1 amplitude, self-speech with positive valence elicited more positive P2 amplitude, and self-speech with both positive and negative valence elicited more positive LPP. ERP differences between self and non-self speech occurred in spite of similar accuracy in the recognition of both types of stimuli. Together, these findings suggest that emotion and speaker's identity interact during speech processing, in line with observations of partially dependent processing of speech and speaker information. Copyright © 2016. Published by Elsevier Inc.
Recognition of plant parts with problem-specific algorithms
NASA Astrophysics Data System (ADS)
Schwanke, Joerg; Brendel, Thorsten; Jensch, Peter F.; Megnet, Roland
1994-06-01
Automatic micropropagation is necessary to produce cost-effective high amounts of biomass. Juvenile plants are dissected in clean- room environment on particular points on the stem or the leaves. A vision-system detects possible cutting points and controls a specialized robot. This contribution is directed to the pattern- recognition algorithms to detect structural parts of the plant.
User Experience of a Mobile Speaking Application with Automatic Speech Recognition for EFL Learning
ERIC Educational Resources Information Center
Ahn, Tae youn; Lee, Sangmin-Michelle
2016-01-01
With the spread of mobile devices, mobile phones have enormous potential regarding their pedagogical use in language education. The goal of this study is to analyse user experience of a mobile-based learning system that is enhanced by speech recognition technology for the improvement of EFL (English as a foreign language) learners' speaking…
2011-11-04
environmen- tal lighting conditions that one can actually come across. L7 and L8 are also cases of low illumination intensity. To produce our experimental...Graphics (Proceedings of ACM SIGGRAPH), 26(3). [9] Riklin- Raviv T., Shashua A., (1999). The quotient image: class based recognition and synthesis under
ERIC Educational Resources Information Center
Franco, Horacio; Bratt, Harry; Rossier, Romain; Rao Gadde, Venkata; Shriberg, Elizabeth; Abrash, Victor; Precoda, Kristin
2010-01-01
SRI International's EduSpeak[R] system is a software development toolkit that enables developers of interactive language education software to use state-of-the-art speech recognition and pronunciation scoring technology. Automatic pronunciation scoring allows the computer to provide feedback on the overall quality of pronunciation and to point to…
Optimal pattern synthesis for speech recognition based on principal component analysis
NASA Astrophysics Data System (ADS)
Korsun, O. N.; Poliyev, A. V.
2018-02-01
The algorithm for building an optimal pattern for the purpose of automatic speech recognition, which increases the probability of correct recognition, is developed and presented in this work. The optimal pattern forming is based on the decomposition of an initial pattern to principal components, which enables to reduce the dimension of multi-parameter optimization problem. At the next step the training samples are introduced and the optimal estimates for principal components decomposition coefficients are obtained by a numeric parameter optimization algorithm. Finally, we consider the experiment results that show the improvement in speech recognition introduced by the proposed optimization algorithm.
The SRI NIST 2010 Speaker Recognition Evaluation System (PREPRINT)
2011-01-01
of several subsystems with the use of adequate side information gives a 35% improvement on the standard telephone condition. We also show that a...ratio and amount of detected speech as side information . The SRI submissions were among the best-performing systems in SRE10. 2. COMMONALITIES This...Documentation Page Form ApprovedOMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response
ERIC Educational Resources Information Center
Houston, Thomas Rappe, Jr.
A homophone is a word having the same pronunciation as another English word, but a different spelling. A list of 7,300 English homophones was compiled and used to construct two tests. Scores were obtained in these and on reference tests for J. P. Guilford's factors CMU, CSU, DMU, and DSU for 70 native speakers of midwestern American English from a…