Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding——EMNLP2016
2019-04-17 14:39
357 查看
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding——EMNLP2016
文章链接:
https://arxiv.org/pdf/1606.01847v3.pdf
概述:本文的主要工作点在多模态融合时 两个特征的融合使用双线性(外积、克罗内克积)能够更好更全面地表征 但外积使维度平方 因此该压缩双线性pooling可以将bilinear的结果压缩 并基于该pooling方式提出了MCB attention network
方法:
网络结构图
收获:
- 双模态融合的方式包括 element-wise product element-wise sum concatenation bilinear等
- 一种映射方法:Count Sketch 映射前向量为a (n维)映射后为b(d维) 这里有两个参数数组
s表示第n个元素加的权 h表示第n个元素加到映射后的那个位置 所以b[h[i]]+=a[i]*s[i]
3. 两向量作外积之后的映射等于两向量分别作映射后的卷积
4.两个向量的卷积可以使用FFT 快速傅里叶变换代替
=
6. 本算法评估计算使用的VQA的评估方法 主要用的real image
7. Visual Genome比VQA分布更均衡 答案长度也更长 visual7W是MC
8. 一个质疑是MCB的提升可能只是增加了参数 然后使用相等参数量的FC做了比对实验
9. 该模型使用两次attention效果最好 该思想与stack residual CoR等相近
10. Bilinear融合的两方目前永远是跨模态的特征 相同模态(如不同Region不使用Bilinear)
相关文章推荐
- 论文笔记 :Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
- 阅读笔记(Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding)
- Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
- Dynamic Memory Networks for Visual and Textual Question Answering
- 论文笔记:Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
- Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering
- Multimodal Compact Bilinear Pooling for Multimodal Neural Machine Translation
- Paper Reading - Snap and ask: Answering Multimodal Question by Naming Visual Instance
- ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering
- VQA 之 Multimodal Compact Bilinear Pooling
- VQA 之 Multimodal Compact Bilinear Pooling
- End-to-end Concept Word Detection for Video Captioning, Retrieval, and Question Answering
- Hierarchical Question-Image Co-Attention for Visual Question Answering
- Reading Note: Gated Self-Matching Networks for Reading Comprehension and Question Answering
- Exploring Models and Data for Image Question Answering
- Hierarchical Question-Image Co-Attention for Visual Question Answering
- 论文笔记:Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answeri
- 论文笔记: Hierarchical Question-Image Co-Attention for Visual Question Answering
- Step by Step Camera Pose Estimation for Visual Tracking and Planar Markers
- CodeRush for Visual Studio .NET v.3.0.2 (Beta) released on 18 Dec 2007 and What'a New