An Overview of Acoustic Modeling Techniques from ICASSP 2012
2013-02-03 21:08
323 查看
An Overview of Acoustic Modeling Techniques from ICASSP 2012
Tara N. Sainath
SLTC Newsletter, May 2012The International Conference on Acoustic, Audio and Signal Processing (ICASSP) was recently hosted in Kyoto, Japan from March 25-30, 2012. Deep Belief Networks (DBNs) have become a popular topic in the speech recognition community of late, for example showing
an improvement of over 30% relative on a Switchboard telephony task [1] compared to a baseline Gaussian Mixture Model/Hidden Markov Model (GMM-HMM) system, a common approach used in acoustic modeling. In this article, we discuss in more detail work in deep
belief networks from ICASSP. Specifically, this article highlights 2 categories of DBN research, namely ideas on improving training/decoding speech and alternative neural network architectures.
IMPROVING TRAINING/DECODING SPEED
DBN training is typically performed serially via stochastic gradient descent. It is often slow and difficult to parallelize. [2] explores learning complex functions from large data sets that can be parallelizable, through an architecture called Deep StackingNetwork (DSN). Figure 1 shows the typical architecture for a DSN. Each module of a DSN consists of 3 layers, an input layer, a weight + non-linearity layer, and then a linear output layer. The paper describes a convex-optimization formulation to efficiently
learn the weights of one DSN module. After one module is finished training, the output from the last module plus the input features are given as input to the next module. DSN performance on the TIMIT phone recognition task is around 24%, compared to a PER
of 21-22% for a DBN. However, given the efficiency of training and the large gains DBNs show over GMM/HMM systems, this method certainly seems like a promising approach.
on both a Switchboard and Voice-Search task, decreasing the number of nonzero connections to 1/3 and building a more generalized model with a reduced word error rate (WER) by 0.2-0.3% absolute compared to fully connected model on both datasets. Furthermore,
the error rate can match the fully connected model by further reducing the non-zero connection to only 12% and 19% on the two respective datasets. Under these conditions, the model size can be reduced to 18% and 29%, and decoding speech improved by 14% and
23%, respectively, on these two datasets.
ALTERNATIVE NEURAL NETWORK ARCHITECTURES
[4] introduces a sequential DBN (SDBN) to model long-range dependencies between concurrent frames by allowing for temporal dependencies between hidden layers. The paper presents pre-training and fine-tuning backpropagation derivations for this model. Experimentson TIMIT show that a simple monophone SDBN system compares favorably to a more complex context-dependent GMM/HMM system. In addition, the number of parameters of the SDBN is much smaller than a regular DBN. Given that DBN training is computationally expensive,
reducing parameters is one approach to speed up training.
[5] explores using convolutional neural networks (CNN), which have generally been explored for computer vision tasks, to speech recognition tasks. A convolutional neural network (CNN), is a type of NN which tries to model local regions in the input space. This
is achieved by applying a local set of filters to process small local parts of the input space. These filters are replicated across all regions of the input space. The idea of local filters is to extract elementary features such as edges, corners, etc from
different parts of the image. Because different features are necessary to extract, a feature map is used to allow for different types of features. All units within one feature map perform the same operation on different parts of the image. After a set of local
filters processes the input, a max-pooling layer performs local averaging and subsampling to reduce the resolution of the feature map, and reduce sensitivity of the output to shifts and distortions. A CNN is typically comprised of multiple layers, which alternates
between convolution and subsampling. The authors show that the proposed CNN method can achieve over 10% relative error reduction in the core TIMIT test set when comparing with a regular DBN using the same number of hidden layers and weights.
[6] analyzes why the performance of DBNs has been so promising in speech recognition. The authors argue that DBNs perform well for three reasons, namely DBNs are a type of neural network which can be fine-tuned, DBNs have many non-linear hidden layers, and
finally DBNs are generatively pre-trained to make the fine-tuning optimization easier. This paper shows experimentally through results on the TIMIT phone recognition task, why each of these aspects improves DBN performance. Furthermore, the authors also show
through dimensionally reduced visualization of the relationships between the feature vectors learned by the DBNs that the similarity structure of the feature vectors at multiple scales is preserved, visually illustrating the benefits of DBNs.
REFERENCES
[1] F. Seide, G. Li, X. Chen, D. Yu, "Feature Engineering In Context-Dependent Deep Neural Networks For Conversational Speech Transcription," in Proc. ASRU, December 2011.[2] L. Deng, D. Yu and J. Platt, “Scalable Stacking and Learning For Building,” in Proc. ICASSP, 2012
[3] D. Yu, F. Seide, G. Li and L. Deng, "Exploiting Sparseness In Deep Neural Networks For Large Vocabulary Speech Recognition", in Proc. ICASSP 2012.
[4] A. Galen and J. Bilmes,“Sequential Deep Belief Networks,” in Proc. ICASSP, 2012
[5] O. Abdel-Hamin, A. Mohamed, H. Jiang and G. Penn, "Applying Convolutional Neural Networks Concepts to Hybrid NN-HMM Model for Speech Recognition," in Proc. ICASSP, 2012.
[6] A. Mohamed, G. Hinton and G. Penn, "Understanding How Deep Belief Networks Perform Acoustic Modeling," in Proc ICASSP, 2012.
If you have comments, corrections, or additions to this article, please contact the author: Tara Sainath, tsainath [at] us [dot] ibm [dot] com.
Tara Sainath is a Research Staff Member at IBM T.J. Watson Research Center in New York. Her research interests are mainly in acoustic modeling. Email: tsainath@us.ibm.com
相关文章推荐
- An Overview of Acoustic Modeling Techniques from ASRU 2011
- 论文笔记:LSTM, GRU, Highway and a Bit of Attention: An Empirical Overview for Language Modeling in Speec
- An Overview of the AI in Football Games from Cheating to Machine Learning..
- startActivity时报错Calling startActivity() from outside of an Activity context requires the FLAG_ACTIVI
- An Overview of Survival Analysis using Complex Sample Data
- startActivity时报错Calling startActivity() from outside of an Activity context requires the FLAG_ACTIVITY_NEW_TASK flag
- Check if the key is composed of an arbitrary number of concatenations of strings from the dictionar
- identifier of an instance of com.you.hibernate.model.TStudentInfo was altered from 6 to 7
- Hibernate学习: 异常 identifier of an instance of com.zhssh.vo.TUser was altered from 1 to 1
- Calling startActivity() from outside of an Activity
- An overview of authentication security features in ASP. NET
- Calling startActivity() from outside of an Activity
- C++引用报错:invalid initialization of non-const reference of type ‘std::string&’ from an rvalue of type
- Calling startActivity() from outside of an Activity context requires the FLAG_ACTIVITY_NEW_TASK fla
- context.startActivity时报错startActivity() from outside of an Activity context require the FLAG_ACTIVITY_NEW_TASK flag
- An Overview of RMI Applications
- 关于运行时异常:Calling startActivity() from outside of an Activity
- identifier of an instance of com.jh.oa.bpm.model.JhBpmFormdata was altered from 266 to 267
- invalid initialization of non-const reference of type ‘xxx&’ from an rvalue of type ‘xxx’
- An Overview of Cisco IOS Versions and Naming