您的位置：首页 > 其它

【论文笔记】Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition

2018-01-28 17:07 1381 查看

Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition

2018-01-28 15:45:13

研究背景和动机：

　　行人动作识别（Human Action Recognition）主要从多个模态的角度来进行研究，即：appearance，depth，optical-flow，以及 body skeletons。这其中，动态的人类骨骼点通常是最具有信息量的，且能够和其他模态进行互补。但是最近研究这个东西的工作，却很少，我们系统的分析了这个模态，目标就是想开发一种 principle and effective 的方法来建模动态的骨骼点，并且将其用于行为识别。

　　动态的骨骼模态，可以自然地表达为时间序列的 human joint locations，行人的动作，就可以看做是分析这些运动模式就可以了。当前的方法主要是简单的将 the joint coordinates 构成特征向量，然后应用 temporal analysis thereon。这些方法的能力是有限的，因为他们并没有显示的探索这些 joints 之间的空间关系，然而这对于理解 human actions 来说，是非常重要的。最近也有些方法将这些连接考虑到他们的模型中，但是，这些方法严重的依赖于手工设计的 parts 或者 rules。这就使得他们的方法很难应用到其他问题上。

　　为了克服这些困难，我们需要一种新的方法可以自动的捕获 the patterns embedded in the spatial configuration of the joints，以及 their temporal dynamics。这是深度神经网络的优势，但是，骨骼点的数据是一种 graph 的结构，而不是 2D 或者 3D 的网格，所以，很难利用当前的 CNN 来直接处理这些数据。最近，graph convolutional networks（GCNs），将 CNN 拓展到了任意结构的 graphs 上来，已经得到了很大的关注，并且得到了广泛的应用，如：image classification, document classification, and semi-supervised learning. 但是，这些方法都是基于一种 fixed graph 作为输入。将 GCNs 在大型数据集上来建模 dynamic graphs，如：human skeleton sequence，还没有被研究。

　　

　　本文通过将 GCN 拓展到 spatial-temporal graph model，称为：ST-GCN。如上图所示，这个模型是在一个骨骼图的序列上构建的，每个节点对应了 a joint of the human body。有两种 edges，即：spatial edges 和 temporal edges。

　　

　　 本文的创新点：

　　1). 　We propose ST-GCN, a generic graph-based formulation for modeling dynamic skeletons, which is the first that applies graph-based neural networks for this task.

　　2). 　We propose several principles in designing convolution kernels in ST-GCN to meet the specific demands in skeleton modeling.

　　3). 　On two large scale datasets for skeleton-based action recognition, the proposed model achieves superior performance as compared to previous methods using hand-crafted parts or traversal rules, with considerably less effort in manual design.

　　The code and models of ST-GCN are made publicly available https://github.com/yysijie/st-gcn.

将 GCNs 拓展到 graph 上主要有如下两大类方法：

　　1).　the spectral perspective, where the locality of graph convolution is considered in the form of spectral analysis;

　　2).　the spatial perspective, where the convolutioan filter are applied directly on the graph nodes and their neighbors.

Spatial Temporal Graph ConvNet

1. Pipeline Overview：

　　基于骨骼的数据可以通过捕获运动的设备，或者视频中姿态估计的算法得到。通常，这些数据是序列的视频帧，每一帧都有关节点坐标的结合。给定 2D 或者 3D 坐标的形式，关节点的序列，我们构建一个时空图（spatial temporal graph），其中，关节点是 graph nodes，人体结构以及时间的自然连接作为 graph 的 nodes。ST-GCN 的输入是 graph nodes 的的联合坐标向量。这个可以类比基于图像的 CNNs ，其输入是 pixel intensity vectors residing on the 2D image grid。时空图卷积操作的多层，将会用来处理这些数据，然后在 graph 上，产生高层的 feature maps。然后用标准的 SoftMax 分类器来进行分类。整个模型是 end to end trained，并可以用 BP 算法进行优化。

2. Skeleton Graph Construction：

　　我们构建一个无向图 G = {V, E}，其有 N 个节点，T frame是 featuring both intra-body and inter-frame connection。

　　在一个 graph 中，节点的集合 $ V = {vti|t=1,..,T} $ includes the all the joints in a skeleton sequence。作为 ST-GCN's 的输入，节点 F(vti) 的特征向量是由坐标向量以及预测的置信度构成的。我们在骨骼序列上构建 spatial-temporal graph 是有两个步骤：

　　首先，在一帧上的节点，我们按照人体结构的连接性，用 edge 将其连接起来；

　　然后，每一个节点，会在连续视频帧上，会被连接到相同的节点。

　　

　　正式的来说，the edge set E 是有两种子集合的：

　　the first subset depicts the intra-skeletion connection at each frame；

　　the second subset contains the inter-frame edges，which connect the same joints in consecutive frames。

　　

3. Spatial Graph Convolutional Neural Network：

　　在我们完全进入所涉及的 the full-fledged ST-GCN，我们首先看单帧上的 graph CNN model。在这种情况下，我们

　　

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航