Mitchel Weintraub - Fremont CA Francoise Beaufays - Palo Alto CA
Assignee:
Nuance Communications - Menlo Park CA
International Classification:
G10L 1520
US Classification:
704226, 704233, 381 943
Abstract:
A method and apparatus for generating a noise-reduced feature vector representing human speech are provided. Speech data representing an input speech waveform are first input and filtered. Spectral energies of the filtered speech data are determined, and a noise reduction process is then performed. In the noise reduction process, a spectral magnitude is computed for a frequency index of multiple frequency indexes. A noise magnitude estimate is then determined for the frequency index by updating a histogram of spectral magnitude, and then determining the noise magnitude estimate as a predetermined percentile of the histogram. A signal-to-noise ratio is then determined for the frequency index. A scale factor is computed for the frequency index, as a function of the signal-to-noise ratio and the noise magnitude estimate. The noise magnitude estimate is then scaled by the scale factor.
Method And System For Learning Linguistically Valid Word Pronunciations From Acoustic Data
Francoise Beaufays - Mountain View CA, US Ananth Sankar - Palo Alto CA, US Mitchel Weintraub - Cupertino CA, US Shaun Williams - San Jose CA, US
Assignee:
Nuance Communications, Inc. - Menlo Park CA
International Classification:
G10L 15/06 G10L 15/10
US Classification:
704236, 704240, 704243
Abstract:
A computerized pronunciation system is provided for generating pronunciations for words and storing the pronunciations in a pronunciation dictionary. The system includes a word list including at least one word; transcribed acoustic data including at least one waveform for the word and transcribed text associated with the waveform; a pronunciation-learning module configured to accept as input the word list and the transcribed acoustic data, the pronunciation-learning module including: sets of initial pronunciations of the word, a scoring module configured score pronunciations and to generate phone probabilities, and a set of alternate pronunciations of the word, wherein the set of alternate pronunciations include a highest-scoring set of initial pronunciations with a highest-scoring substitute phone substituted for a lowest-probability phone; and a pronunciation dictionary configured to receive the highest-scoring set of initial pronunciations and the set of alternate pronunciations.
Method For Learning Linguistically Valid Word Pronunciations From Acoustic Data
Francoise Beaufays - Mountain View CA, US Ananth Sankar - Palo Alto CA, US Mitchel Weintraub - Cupertino CA, US Shaun Williams - San Jose CA, US
Assignee:
Nuance Communications, Inc. - Menlo Park CA
International Classification:
G10L 15/06 G10L 15/10
US Classification:
704236, 704240, 704243
Abstract:
A computerized method is provided for generating pronunciations for words and storing the pronunciations in a pronunciation dictionary. The method includes graphing sets of initial pronunciations; thereafter in an ASR subsystem determining a highest-scoring set of initial pronunciations; generating sets of alternate pronunciations, wherein each set of alternate pronunciations includes the highest-scoring set of initial pronunciations with a lowest-probability phone of the highest-scoring initial pronunciation substituted with a unique-substitute phone; graphing the sets of alternate pronunciations; determining in the ASR subsystem a highest-scoring set of alternate pronunciations; and adding to a pronunciation dictionary the highest-scoring set of alternate pronunciations.
Training An Automatic Speech Recognition System Using Compressed Word Frequencies
Respective word frequencies may be determined from a corpus of utterance-to-text-string mappings that contain associations between audio utterances and a respective text string transcription of each audio utterance. Respective compressed word frequencies may be obtained based on the respective word frequencies such that the distribution of the respective compressed word frequencies has a lower variance than the distribution of the respective word frequencies. Sample utterance-to-text-string mappings may be selected from the corpus of utterance-to-text-string mappings based on the compressed word frequencies. An automatic speech recognition (ASR) system may be trained with the sample utterance-to-text-string mappings.
Method And System For Automatic Text-Independent Grading Of Pronunciation For Language Instruction
Leonardo Neumeyer - Palo Alto CA Horacio Franco - Atherton CA Mitchel Weintraub - Fremont CA Patti Price - Menlo Park CA Vassilios Digalakis - Chania, GR
Assignee:
SRI International - Menlo Park CA
International Classification:
G10L 1508
US Classification:
704246
Abstract:
Pronunciation quality is automatically evaluated for an utterance of speech based on one or more pronunciation scores. One type of pronunciation score is based on duration of acoustic units. Examples of acoustic units include phones and syllables. Another type of pronunciation score is based on a posterior probability that a piece of input speech corresponds to a certain model such as an HMM, given the piece of input speech. Speech may be segmented into phones and syllables for evaluation with respect to the models. The utterance of speech may be an arbitrary utterance made up of a sequence of words which had not been encountered before. Pronunciation scores are converted into grades as would be assigned by human graders. Pronunciation quality may be evaluated in a client-server language instruction environment.
Method For Spectral Estimation To Improve Noise Robustness For Speech Recognition
Adoram Erell - Ramat Aviv, IL Mitchel Weintraub - Fremont CA
Assignee:
SRI International - Menlo Park CA
International Classification:
G10L 500
US Classification:
381 47
Abstract:
A method is disclosed for use in preprocessing noisy speech to minimize likelihood of error in estimation for use in a recognizer. The computationally-feasible technique, herein called Minimum-Mean-Log-Spectral-Distance (MMLSD) estimation using mixture models and Marlov models, comprises the steps of calculating for each vector of speech in the presence of noise corresponding to a single time frame, an estimate of clean speech, where the basic assumptions of the method of the estimator are that the probability distribution of clean speech can be modeled by a mixture of components each representing a different speech class assuming different frequency channels are uncorrelated within each class and that noise at different frequency channels is uncorrelated. In a further embodiment of the invention, the method comprises the steps of calculating for each sequence of vectors of speech in the presence of noise corresponding to a sequence of time frames, an estimate of clean speech, where the basic assumptions of the method of the estimator are that the probability distribution of clean speech can be modeled by a Markov process assuming different frequency channels are uncorrelated within each state of the Markov process and that noise at different frequency channels is uncorrelated.
Method And Apparatus For Automatic Text-Independent Grading Of Pronunciation For Language Instruction
Leonardo Neumeyer - Palo Alto CA Horacio Franco - Atherton CA Mitchel Weintraub - Fremont CA Patti Price - Menlo Park CA Vassilios Digalakis - Chania, GR
Assignee:
SRI International - Menlo Park CA
International Classification:
G10L 1508
US Classification:
704246
Abstract:
Pronunciation quality is automatically evaluated for an utterance of speech based on one or more pronunciation scores. One type of pronunciation score is based on duration of acoustic units. Examples of acoustic units include phones and syllables. Another type of pronunciation score is based on a posterior probability that a piece of input speech corresponds to a certain model, such as a hidden Markov model, given the piece of input speech. Speech may be segmented into phones and syllable for evaluation with respect to the models. The utterance of speech may be an arbitrary utterance made up of a sequence of words which had not been encountered before. Pronunciation scores are converted into grades as would be assigned by human graders. Pronunciation quality may be evaluated in a client-server language instruction environment.
Method For Establishing Handset-Dependent Normalizing Models For Speaker Recognition
Larry P. Heck - Sunnyvale CA Mitchel Weintraub - Fremont CA
Assignee:
SRI International - Menlo Park CA
International Classification:
G10L 506
US Classification:
704234
Abstract:
Adverse effects of type mismatch between acoustic input devices used during testing and during training in machine-based recognition of the source of acoustic phenomena are minimized. A normalizing model is matched to a source model based, or dependent, upon an acoustic input device whose transfer characteristics color acoustic characteristics of a source as represented in the source model. An application of the present invention is to speaker recognition, i. e. , recognition of the identity of a speaker by the speaker's voice.