|   | CMU-CS-97-148 Computer Science Department
 School of Computer Science, Carnegie Mellon University
 
    
     
 CMU-CS-97-148
 
Vocal Tract Length Normalization for Large
Vocabulary Continuous Speech Recognition 
Puming Zhan, Alex Waibel 
May 1997  
Also appears as Language Technologies Institute Technical Report 
CMU-LTI-97-150. 
CMU-CS-97-148.ps Keywords: Frequency warping, VTLN, vocal tract length normalization,
speaker normalization, adaptation, speech recognition
 Generally speaking, the speaker-dependence of a speech recognition 
system comes from speaker-dependent speech feature. The variation of vocal 
tract shape is one of the major source of inter-speaker variations. In this 
paper, we address several methods of vocal tract length normalization (VTLN) 
for large vocabulary continuous speech recognition: 
(1) explore the bilinear warping VTLN in frequency domain;
(2) propose a speaker-specific Bark/Mel scale VTLN in Bark/Mel domain;
(3) investigate adaptation of the normalization factor.
Our experimental results show that the speaker-specific Bark/Mel scale VTLN
is better than the piecewise/bilinear warping VTLN in frequency domain. It 
can reduce up to 12% word error rate for our Spanish and English spontaneous 
speech scheduling task database. For adaptation of the normalization factor, 
our experimental results show that promising result can be obtained by using 
not more than three utterances from a speaker to estimate the normalization 
factor, and the unsupervised adaptation mode works as well as the supervised 
one. Therefore, the computational complexity of VTLN can be avoided by 
learning the normalization factor from very few utterances of a new speaker.
 
22 pages 
 |