您的位置:首页 > 产品设计 > UI/UE

An Overview of Acoustic Modeling Techniques from ICASSP 2012

2013-02-03 21:08 323 查看


An Overview of Acoustic Modeling Techniques from ICASSP 2012


Tara N. Sainath

SLTC Newsletter, May 2012

The International Conference on Acoustic, Audio and Signal Processing (ICASSP) was recently hosted in Kyoto, Japan from March 25-30, 2012. Deep Belief Networks (DBNs) have become a popular topic in the speech recognition community of late, for example showing
an improvement of over 30% relative on a Switchboard telephony task [1] compared to a baseline Gaussian Mixture Model/Hidden Markov Model (GMM-HMM) system, a common approach used in acoustic modeling. In this article, we discuss in more detail work in deep
belief networks from ICASSP. Specifically, this article highlights 2 categories of DBN research, namely ideas on improving training/decoding speech and alternative neural network architectures.


IMPROVING TRAINING/DECODING SPEED

DBN training is typically performed serially via stochastic gradient descent. It is often slow and difficult to parallelize. [2] explores learning complex functions from large data sets that can be parallelizable, through an architecture called Deep Stacking
Network (DSN). Figure 1 shows the typical architecture for a DSN. Each module of a DSN consists of 3 layers, an input layer, a weight + non-linearity layer, and then a linear output layer. The paper describes a convex-optimization formulation to efficiently
learn the weights of one DSN module. After one module is finished training, the output from the last module plus the input features are given as input to the next module. DSN performance on the TIMIT phone recognition task is around 24%, compared to a PER
of 21-22% for a DBN. However, given the efficiency of training and the large gains DBNs show over GMM/HMM systems, this method certainly seems like a promising approach.
DSN Architecture


[3] explores enforcing random parameter sparseness during DBN training as a soft regularization and convex constraint optimization problems. The authors also propose a novel data structure to take advantage of the sparsity. The proposed methodology was evaluated
on both a Switchboard and Voice-Search task, decreasing the number of nonzero connections to 1/3 and building a more generalized model with a reduced word error rate (WER) by 0.2-0.3% absolute compared to fully connected model on both datasets. Furthermore,
the error rate can match the fully connected model by further reducing the non-zero connection to only 12% and 19% on the two respective datasets. Under these conditions, the model size can be reduced to 18% and 29%, and decoding speech improved by 14% and
23%, respectively, on these two datasets.


ALTERNATIVE NEURAL NETWORK ARCHITECTURES

[4] introduces a sequential DBN (SDBN) to model long-range dependencies between concurrent frames by allowing for temporal dependencies between hidden layers. The paper presents pre-training and fine-tuning backpropagation derivations for this model. Experiments
on TIMIT show that a simple monophone SDBN system compares favorably to a more complex context-dependent GMM/HMM system. In addition, the number of parameters of the SDBN is much smaller than a regular DBN. Given that DBN training is computationally expensive,
reducing parameters is one approach to speed up training.

[5] explores using convolutional neural networks (CNN), which have generally been explored for computer vision tasks, to speech recognition tasks. A convolutional neural network (CNN), is a type of NN which tries to model local regions in the input space. This
is achieved by applying a local set of filters to process small local parts of the input space. These filters are replicated across all regions of the input space. The idea of local filters is to extract elementary features such as edges, corners, etc from
different parts of the image. Because different features are necessary to extract, a feature map is used to allow for different types of features. All units within one feature map perform the same operation on different parts of the image. After a set of local
filters processes the input, a max-pooling layer performs local averaging and subsampling to reduce the resolution of the feature map, and reduce sensitivity of the output to shifts and distortions. A CNN is typically comprised of multiple layers, which alternates
between convolution and subsampling. The authors show that the proposed CNN method can achieve over 10% relative error reduction in the core TIMIT test set when comparing with a regular DBN using the same number of hidden layers and weights.

[6] analyzes why the performance of DBNs has been so promising in speech recognition. The authors argue that DBNs perform well for three reasons, namely DBNs are a type of neural network which can be fine-tuned, DBNs have many non-linear hidden layers, and
finally DBNs are generatively pre-trained to make the fine-tuning optimization easier. This paper shows experimentally through results on the TIMIT phone recognition task, why each of these aspects improves DBN performance. Furthermore, the authors also show
through dimensionally reduced visualization of the relationships between the feature vectors learned by the DBNs that the similarity structure of the feature vectors at multiple scales is preserved, visually illustrating the benefits of DBNs.


REFERENCES

[1] F. Seide, G. Li, X. Chen, D. Yu, "Feature Engineering In Context-Dependent Deep Neural Networks For Conversational Speech Transcription," in Proc. ASRU, December 2011.

[2] L. Deng, D. Yu and J. Platt, “Scalable Stacking and Learning For Building,” in Proc. ICASSP, 2012

[3] D. Yu, F. Seide, G. Li and L. Deng, "Exploiting Sparseness In Deep Neural Networks For Large Vocabulary Speech Recognition", in Proc. ICASSP 2012.

[4] A. Galen and J. Bilmes,“Sequential Deep Belief Networks,” in Proc. ICASSP, 2012

[5] O. Abdel-Hamin, A. Mohamed, H. Jiang and G. Penn, "Applying Convolutional Neural Networks Concepts to Hybrid NN-HMM Model for Speech Recognition," in Proc. ICASSP, 2012.

[6] A. Mohamed, G. Hinton and G. Penn, "Understanding How Deep Belief Networks Perform Acoustic Modeling," in Proc ICASSP, 2012.

If you have comments, corrections, or additions to this article, please contact the author: Tara Sainath, tsainath [at] us [dot] ibm [dot] com.

Tara Sainath is a Research Staff Member at IBM T.J. Watson Research Center in New York. Her research interests are mainly in acoustic modeling. Email: tsainath@us.ibm.com
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐