您的位置：首页 > 其它

Facial Expression Recognition Using Enhanced Deep 3D Convolutional Neural Networks 视频人脸表情识别论文学习笔记

2018-02-27 10:37 921 查看

原文链接：https://arxiv.org/abs/1705.07871v1
首先paper的任务是：实现视频中人脸表情的识别，视频转化为10frames的序列进行处理。原文：we divided videos into sequences of ten frames to shape the input tensor for our network。
网络总体结构：3DIR(3D Inception-ResNet)+landmark+LSTM. LSTM的目的是处理temporal时序信息。
论文附图：

1.3DIR:
These landmarks are multiplied with the input tensor in the residual module which is replaced with the shortcuts in the traditional residual layer.
对普通的resnet结构进行了许多变种，作者也是经过很多实验最后选择采用3DIR这一结构。原文：This network is the result of investigating several variations of Inception-ResNet module and achieves better recognition rates comparing to our other attempts in several databases.
1.13DIR是个怎么样的结构？
将那条直接从输入到输出的shortcuts改为连接landmark，并进行Hadamard product。即两个矩阵的对应位置上元素直接相乘(下图中的Elem-Mul操作)。图解见下：

2.landmrk:
目的是为了突出对面部表情识别更有贡献的面部特征(们)。原文：the main reason we use facial landmarks in our network is to differentiate between the importance of main facial components (such as eyebrows, lip corners, eyes, etc.) and other parts of the face which are less expressive of facial expressions.
2.1landmark的获取：
采用OPenCV先提取人脸框，再利用 alignment algorithm等算法实现人脸66个关键点的提取。原文：OpenCV face recognition is used to obtain bounding boxes of the faces. A face alignment algorithm via regression local binary features [41, 61] was used to extract 66 facial landmark points.
2.2landmark辅助网络训练：
得到人脸landmark map后，需要将其与inception结构传来的feature map进行Hadamard product计算。因为是矩阵对应位置的元素相乘，故需要两张map的尺寸匹配。所以需要resize landmark map，且与inception网络结构是同步运行的。原文：the facial landmark filters are generated for each sequence automatically during training phase. Given the facial landmarks for each frame of a sequence, we initially resize all of the images in the sequence to their corresponding filter size in the network.
也正是因为landmark map需要根据inception结构传来的feature map尺寸而resize，所以3DIR只替换了前2层的普通resent结构。因为第三层传来的map size太小了，不够landmark去mark人脸的66个特征点...。原文：We do not incorporate the facial landmarks with the third 3D Inception-ResNet module since the resulting feature map size at this stage becomes very small for calculating facial landmark filter。
2.2landmark map的weights分配：
Elem-Mul操作即需要给定计算的weigths。论文按照landmark map上各个pixel与point点间的距离来分配权值：we assign weights to all of the pixels in a frame of a sequence based on their distances to the detected landmarks. The closer a pixel is to a facial landmark, the greater weight is assigned to that pixel. After investigating several distance measures, we concluded that Manhattan distance with a linear weight function results in a better recognition rate in various databases. The Manhattan distance between two items is the sum of the differences of their corresponding components (in this case two components).
可见，论文选择使用Manhattan distance，公式如下：

where dM(L,P) is the Manhattan distance between the facial landmark L and pixel P.

为避免距离较近的points间，的pixels的weigths分配混乱，论文以每个point点的7x7pixels邻域进行weights分配，即对每个point点周边的49个pixels进行weights分配。原文：In order to avoid overlapping between two adjacent facial landmarks, we define a 7 × 7 window around each facial landmark and apply the weight function for these 49 pixels for each landmark separately.
3.LSTM:
使用LSTM单元非常有意义，因为3DIR结构处理所得的特征映射结果包含特征映射中序列的时间信息。因此，将3DIR的结果特征图矢量化到其序列维上，将为LSTM单元提供所需的顺序信息。我们实验研究后发现，LSTM的隐藏层大小设置为200是FER任务的最合理设置。原文：Using the LSTM unit makes perfect sense since the resulted feature map from the 3DIR unit contains the time notion of the sequences within the feature map. Therefore, vectorizing the resulting feature map of 3DIR on its sequence dimension, will provide the required sequenced input for the LSTM unit. We investigated that 200 hidden units for the LSTM unit is a reasonable amount for the task of FER.
论文方法的一些tricks及网络超参设置：
The proposed network was implemented using a combination of TensorFlow and TFlearn toolboxes on NVIDIA Tesla K40 GPUs. In the training phase we used asynchronous stochastic gradient descent with momentum of 0.9, weight decay of 0.0001, and learning rate of 0.01. We used categorical cross entropy as our loss function and accuracy as our evaluation metric.
4.Face Databases:
使用了：MMI[40]、CK+[32]、FERA[2]、DISFA[33]。原文：We evaluate our proposed method on MMI [40], extended CK+ [32], GEMEP-FERA [2], and DISFA [33] which contain videos of annotated facial expressions.
5.examples and results：
进行了两套实验验证论文所提方法的有效性，单一数据集subject-independent和交叉数据集cross-database。原文：We evaluate the accuracy of our proposed method with two different sets of experiments: “subject-independent” and “cross-database” evaluations.
并且进行了是否添加辅助landmark的对比实验：we also provide the results of our network while the landmark multiplication unit is removed and replaced with a simple shortcut between the input and output of the residual unit.
还给出了传统2D Inception-ResNet结构的FER结果。原文：Table 1 also provides the recognition rates of the traditional 2D Inception-ResNet from [20] which does not contain facial landmarks and the LSTM unit (DISFA is not experimented in this study). Comparing the recognition rates of the 3D and 2D Inception-ResNets in Table 1, shows that the sequential processing of facial expressions considerably enhances the recognition rate. 表1中3D和2D Inception-ResNets的识别率比较表明，面部表情的顺序处理大大提高了识别率。这展现了3D较之2D的优势所在。
论文将实验结果数据以Table和confusion matrices (混淆矩阵)的形式展现。
数据量有限，故采用cross-database来增加data量继续实验。原文：Also, due to the limited number of samples in these databases, it is difficult to properly train a deep neural network and avoid the overfitting problem. For these reasons and in order to have a better understanding about our proposed method, we also experimented the cross-database task.

5.1results:

6.conclusion：
we proposed the 3D Inception-ResNet (3DIR) network which extends the well-known 2D Inception-ResNet module for processing image sequences. This additional dimension will result in a volume of feature maps and will extract the spatial relations between frames in a sequence. This module is followed by an LSTM which takes these temporal relations into account and uses this information to classify the sequences. In order to differentiate between facial components and other parts of the face, we incorporated facial landmarks in our proposed method. These landmarks are multiplied with the input tensor in the residual module which is replaced with the shortcuts in the traditional residual layer. 论文提出了3D Inception-ResNet（3DIR）网络，该网络扩展了用于处理图像序列的著名2D Inception-ResNet模块。这个额外的维度将产生一定数量的特征映射并且将提取序列中帧之间的空间关系。该模块之后是LSTM，它将这些时间关系考虑在内，并使用这些信息对序列进行分类。为了区分脸部组件和脸部的其他部分，我们在我们提出的方法中引入了脸部标志。这些地标与剩余模块中的输入张量相乘，并用传统剩余层中的快捷键替换。
This additional dimension will result in a volume of feature maps and will extract the spatial relations between frames in a sequence. 这句话我的理解是：3D较之2D，多加入的那一维对frames进行了10-8-6的压缩(细看网络结构可发现)，提取序列中帧之间的空间关系。(如有错误，还请指正...)

论文的codes没开源，本想尝试着复现一下的...有新的理解会再来补充，欢迎大家多多指正。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航