VD-BERT: A Unified Vision and Dialog Transformer with BERT

Yue Wang, Shafiq Joty, Michael Lyu, Irwin King, Caiming Xiong, Steven C.H. Hoi

Language Grounding to Vision, Robotics and Beyond Long Paper

Gather-2G: Nov 17, Gather-2G: Nov 17 (10:00-12:00 UTC) [Join Gather Meeting]

You can open the pre-recorded video in a separate window.

Abstract: Visual dialog is a challenging vision-language task, where a dialog agent needs to answer a series of questions through reasoning on the image content and dialog history. Prior work has mostly focused on various attention mechanisms to model such intricate interactions. By contrast, in this work, we propose VD-BERT, a simple yet effective framework of unified vision-dialog Transformer that leverages the pretrained BERT language models for Visual Dialog tasks. The model is unified in that (1) it captures all the interactions between the image and the multi-turn dialog using a single-stream Transformer encoder, and (2) it supports both answer ranking and answer generation seamlessly through the same architecture. More crucially, we adapt BERT for the effective fusion of vision and dialog contents via visually grounded training. Without the need of pretraining on external vision-language data, our model yields new state of the art, achieving the top position in both single-model and ensemble settings (74.54 and 75.35 NDCG scores) on the visual dialog leaderboard. Our code and pretrained models are released at https://github.com/salesforce/VD-BERT.
NOTE: Video may display a random order of authors. Correct author list is at the top of this page.

Connected Papers in EMNLP2020

Similar Papers

Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding
Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, Jason Baldridge,
Where Are You? Localization from Embodied Dialog
Meera Hahn, Jacob Krantz, Dhruv Batra, Devi Parikh, James Rehg, Stefan Lee, Peter Anderson,
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, Jingjing Liu,
oLMpics - On what Language Model Pre-training Captures
Alon Talmor, Yanai Elazar, Yoav Goldberg, Jonathan Berant,