3D人体姿态估计--Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose
2017-07-19 10:08
791 查看
Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose
Project and Code: https://www.seas.upenn.edu/~pavlakos/projects/volumetric/
输入一张彩色图像,输出人体 3D姿态信息,采用 CNN网络端对端训练,技术创新点:1)对三维空间进行网格划分,2)Coarse-to-Fine 渐进优化
流程示意图:
![](http://img.blog.csdn.net/20170719093827826?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvemhhbmdqdW5oaXQ=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/SouthEast)
3.1. Volumetric representation for 3D human pose
对于3D 人体姿态估计问题,问题的一般定义是 人体N个关节,每个关节有一个 3D 坐标(x,y,z)
![](http://img.blog.csdn.net/20170719094320203?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvemhhbmdqdW5oaXQ=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/SouthEast)
上面公式是计算 预测坐标和真值坐标的欧式距离。 尽管这样描述问题很简单明了,但是这个问题是 highly non-linear problem,很难学习。
这里我们对3D 空间进行网格划分, For each joint we create a volume of size w×h×d,对每个关节我们定义一个 w×h×d 的 volume(容积器),将该volume 划分为 w×h×d,假设 p(i,j,k) 表示 一个关节落入容积器的(i,j,k) voxel(三维坐标点)。
同时定义一个关节真值坐标(x,y,x)落入容积器的(i,j,k) voxel 的概率如下:
![](http://img.blog.csdn.net/20170719095604310?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvemhhbmdqdW5oaXQ=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/SouthEast)
误差函数定义如下:
![](http://img.blog.csdn.net/20170719095654556?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvemhhbmdqdW5oaXQ=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/SouthEast)
上述问题的定义方式能够简化问题的求解。同时也为后面的Coarse-to-fine 提供了好的基础
A major advantage of the volumetric representation is that it casts the highly non-linear problem of direct 3D coordinate regression to a more manageable form of prediction in a discretized space
3.2. Coarse-to-fine prediction
![](http://img.blog.csdn.net/20170719093935800?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvemhhbmdqdW5oaXQ=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/SouthEast)
注意这里的 Coarse-to-Fine 主要是针对 第三维度深度 z 而言的,深度信息是最难的,2D 已经做的比较成熟了。
In particular, the first steps are supervised with lower resolution targets for the (most challenging and technically unobserved) z-dimension. Precisely, we use targets of size 64 × 64 × d per joint, where d typically takes values from the set {1,2,4,8,16,32,64}
d 的取值为 {1,2,4,8,16,32,64}
3.3. Decoupled architecture with volumetric target
在某些情况下因为关节的 3D 真值数据无法获取 导致不能进行 端对端训练,例如我们使用 in-the-wild images。 这里我们参考 3D Interpreter Network 【35】,进行2步训练。
predicting 2D keypoint heatmaps, followed by an inference step of the 3D joint positions with our volumetric representation
首先预测 2D 关节点heatmaps, 然后在3D 网格空间坐标上进行 3D 关节点坐标推理
The first step can be trained with 2D labeled in-the-wild imagery, while the second step requires only 3D data (e.g., MoCap)
Independently, each of these sources are abundantly available
![](http://img.blog.csdn.net/20170719100445405?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvemhhbmdqdW5oaXQ=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/SouthEast)
Empirical evaluation
![](http://img.blog.csdn.net/20170719100733588?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvemhhbmdqdW5oaXQ=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/SouthEast)
![](http://img.blog.csdn.net/20170719100740222?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvemhhbmdqdW5oaXQ=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/SouthEast)
Project and Code: https://www.seas.upenn.edu/~pavlakos/projects/volumetric/
输入一张彩色图像,输出人体 3D姿态信息,采用 CNN网络端对端训练,技术创新点:1)对三维空间进行网格划分,2)Coarse-to-Fine 渐进优化
流程示意图:
3.1. Volumetric representation for 3D human pose
对于3D 人体姿态估计问题,问题的一般定义是 人体N个关节,每个关节有一个 3D 坐标(x,y,z)
上面公式是计算 预测坐标和真值坐标的欧式距离。 尽管这样描述问题很简单明了,但是这个问题是 highly non-linear problem,很难学习。
这里我们对3D 空间进行网格划分, For each joint we create a volume of size w×h×d,对每个关节我们定义一个 w×h×d 的 volume(容积器),将该volume 划分为 w×h×d,假设 p(i,j,k) 表示 一个关节落入容积器的(i,j,k) voxel(三维坐标点)。
同时定义一个关节真值坐标(x,y,x)落入容积器的(i,j,k) voxel 的概率如下:
误差函数定义如下:
上述问题的定义方式能够简化问题的求解。同时也为后面的Coarse-to-fine 提供了好的基础
A major advantage of the volumetric representation is that it casts the highly non-linear problem of direct 3D coordinate regression to a more manageable form of prediction in a discretized space
3.2. Coarse-to-fine prediction
注意这里的 Coarse-to-Fine 主要是针对 第三维度深度 z 而言的,深度信息是最难的,2D 已经做的比较成熟了。
In particular, the first steps are supervised with lower resolution targets for the (most challenging and technically unobserved) z-dimension. Precisely, we use targets of size 64 × 64 × d per joint, where d typically takes values from the set {1,2,4,8,16,32,64}
d 的取值为 {1,2,4,8,16,32,64}
3.3. Decoupled architecture with volumetric target
在某些情况下因为关节的 3D 真值数据无法获取 导致不能进行 端对端训练,例如我们使用 in-the-wild images。 这里我们参考 3D Interpreter Network 【35】,进行2步训练。
predicting 2D keypoint heatmaps, followed by an inference step of the 3D joint positions with our volumetric representation
首先预测 2D 关节点heatmaps, 然后在3D 网格空间坐标上进行 3D 关节点坐标推理
The first step can be trained with 2D labeled in-the-wild imagery, while the second step requires only 3D data (e.g., MoCap)
Independently, each of these sources are abundantly available
Empirical evaluation
相关文章推荐
- 3D【4】人脸重建:Large Pose 3D Face Reconstruction from a Single Image via Direct Volumetric
- Large Pose 3D Face Reconstruction from a Single Image via Direct Volumetric CNN Regression
- 《Towards Viewpoint Invariant 3D Human Pose Estimation》--深度图领域人体姿态估计的CNN算法
- 人体姿态估计--Stacked Hourglass Networks for Human Pose Estimation
- 在自遮挡下的单目图像3D姿态估计 Monocular Image 3D Human Pose Estimation under Self-Occlusion (ICCV 13)
- 人体姿态估计(human pose estimate)
- 论文阅读笔记:A 3D Coarse-to-Fine Framework for Automatic Pancreas Segmentation
- 人体姿态估计综述(Human Pose Estimation Overview)
- 论文阅读:Volumetric and Multi-View CNNs for Object Classification on 3D Data
- 第七章 采用AAM和POSIT的3D头部姿态估计——Chapter 7:3D Head Pose Estimation Using AAM and POSIT
- 机器视觉的情感判断实践(From Pixels to Sentiment: Fine-tuning CNNs for Visual Sentiment Prediction)
- 论文笔记:Label Refinement Network for Coarse-to-Fine Semantic Segmentation
- 人体运动识别:motionlet: Mid-level 3D level parts for human motion recognition
- Look Closer to See Better Recurrent Attention Convolutional Neural Network for Fine-grained Image Re
- 车辆2D/3D--Deep MANTA: A Coarse-to-fine Many-Task Network for joint 2D and 3D vehicle analysis
- 姿态估计 Articulated Pose Estimation by a Graphical Model with Image Dependent Pairwise Relations
- 人体姿态估计数据集整理(Pose Estimation/Keypoint)
- 2017CV技术报告:从3D物体重建到人体姿态估计
- 语义分割--Label Refinement Network for Coarse-to-Fine Semantic Segmentation
- 车辆检测“Deep MANTA: A Coarse-to-fine Many-Task Network for joint 2D and 3D vehicle analysis from monoc”