From VQA to VLN: Recent Advances in Vision-and-Language Research

In conjunction with CVPR 2021

June 20th 2021 (9:00 AM - 5:00 PM PDT)

Location: Virtual

Photo by NASA on Unsplash

CVPR 2021 Tutorial on "From VQA to VLN: Recent Advances in Vision-and-Language Research"

A long-term goal of AI research is to build intelligent agents that can see the rich visual environment around us, communicate this understanding in natural language to humans and other agents, and act in a physical or embodied environment. To this end, recent advances at the nexus of Computer Vision and Natural Language Processing have made tremendous progress -- from generating natural language descriptions of images/videos, to answering questions about them, and to holding free-form conversations about visual content.

Most recently, Embodied AI, where embodied agents are trained to perform various tasks in egocentric perception, has attracted a surge of interest within computer vision, natural language processing and robotics communities. Vision-Language Navigation (VLN) is one fundamental topic in Embodied AI that was proposed by Anderson and Wu et al..

In this tutorial, we will not only cover the latest approaches and principles at the frontier of vision-and-language research, but also present a comprehensive overview of the field of VLN. The tutorial will be a full-day event (9:00 am to 5:00pm) with several middle breaks.

Program (PDT, UTC-7)

Our program is divided into two sub-sessions: (1) Vision-and-Language Pre-training and (2) Vision-and-Language Navigation. Recording of panel discussion will be available after the tutorial.

Prerecorded Sessions
4min Opening Remarks   [Video] Jingjing Liu and Xiaodong He
50min Representations and Training Strategies for VLP   [Video]  [Slides] Zhe Gan
40min Robustness, Efficiency and Extensions for VLP   [Video]  [Slides] Linjie Li
40min Video-and-Language Pre-training  [Video]  [Slides] Luowei Zhou
42min Introduction to VLN   [Video]  [Slides] Qi Wu
55min Generalizable VLN Methods   [Video]  [Slides] Xin Eric Wang
58min Forward to Realistic VLN   [Video]  [ Slides] Yoav Artzi and Peter Anderson
15min VLN Summary   [Video]  [ Slides] Qi Wu
Live Session
16:00-17:00 Panel Discussion LIVE on Zoom   [Video] All speakers


Peter Anderson

Google Research

Yoav Artzi

Cornell University

Zhe Gan


Xiaodong He


Linjie Li


Jingjing Liu

Tsinghua University

Xin (Eric) Wang

UC Santa Cruz

Qi Wu

University of Adelaide

Luowei Zhou



Contact the Organizing Committee: