Summer School Series: Lecture 6 by Arsha Nagrani

Archana Swaminathan

Aug 25, 2020 4 min read Google AI Summer School, Experiences

This final lecture was delivered by Arsha Nagrani, a recent Ph.D. graduate from Oxford University’s VGG group, and an incoming research scientist at Google Research. Her talk was called Multimodality for Video Understanding.

Video Understanding

Videos provide us with far more information than images. Multimodal refers to many mediums for learning, here it can be time, sound and speech. Videos are all around us (30k newly created content videos are uploaded to YouTube every hour). However, these have high dimensionality and are difficult to process and annotate.

Complementarity among signals

Vision (scene)
Sound (content of speech)

Redundancy between signals

Helps recognize person, face+sound, thus can be a useful form of weak supervision. The redundant information comes from background sounds, foreground audio, signals identified from speech and the content of speech.

Thus, best way to exploit multimodal nature of videos is to work with the complementarity and redundancy.

Suitable tasks

Suitable tasks for video understanding are:

Video classification

single label
infinite number of possible classes
ambiguity in the label space

Action recognition: more fine grained, the motion is important, human centric

It is important to note that labelling actions in videos is extremely expensive and existing models do not generalize well to new domains.

In this context, can we use speech as a form of supervision? For example, narrated video clips and lifestyle Vlogs.

Movies

General domain of movies: people speak about their actions. However, sometimes speech is completely unrelated, giving us noise. We need to learn when speech matches action. An example of work in this field is End-to-End Learning of Visual Representations from Uncurated Instructional Videos. This work reduces noise by using the MIL-NCE loss.

Can we first train a model to recognize actions and then see if it should be used for supervision? An interesting discovery Arsha made was using Movie Screenplays, that contain both speech segments and scene directions with actions. Using this:

We can obtain speech-action pairs
Retrieve speech segments with verbs
Train the Speech2Action model to predict action, with a BERT-Backbone (movie scripts scraped from IMSDB)
Apply to closed captions of unlabelled videos
Apply to large movie corpus

The Speech2Action model recognizes rare actions, and is a visual classifier on weakly labelled data (S3D-G model with cross-entropy loss)

Evaluation is done on the AVA and HMDB-51 (transfer learning) datasets. It gets abstract actions like count and follow too.

Multimodal Complementarity

This refers to fusing info from multiple modalities for video text retrieval, like:

Finding video corresponding to text queries
More to videos than just actions like object, scene etc.

Supervisions:

It’s not easy to get the complete combination of captions, this is a very subjective task
Need extremely large datasets

What Arsha does is rely on expert models trained for different tasks like object detection, face detection, action recognition, OCR etc. These are all applied to the video and features are extracted. The framework is a joint video text embedding, with the video encoder + text query encoder = joint embedding space (similarity should be really high if related). It is necessary for the video encoder to be discriminative and retain specific information.

Collaborative Gating

For each expert, generate attention mask by looking at the other experts (Use What You Have: Video Retrieval Using Representations From Collaborative Experts, BMVC 2019)

Trained using bi-directional max margin ranking loss
Adding in more experts massively increases performance
Main boost is from the object embeddings

Another paper that Arsha discussed was Multi-modal Transformer for Video Retrieval, ECCV 2020 . This takes features that are taken at different time stamps for each task and aggregrate for the embeddings. The expert and temporal embeddings are added and summed up.

Conclusion

More modalities is better (because more complementarity)
Time (modelling time along with modalities is interesting, some modalities train faster than the others)
Mid fusion is better than late (Attention truly is what you need)
Our world is multimodal, it doesn’t make sense to work with modalities in isolation
Use the redundant and complementary information from vision, audio and speech to massively reduce annotations

Open Research Questions:

Extended Temporal Sequences (beyond 10s):

Backprop + memory restricts current video architectures to 64 frames
For longer we rely on pre-extracted features
Need new datasets to drive innovation

Moving away from supervision: is an upper bound on self supervision being appraoched?
The world is multimodal: how do we design good fusion architectures?

Arsha thus concluded a fantastic talk that described the cutting-edge research that her team at Oxford and Google is conducting. It was tremendously insightful and inspirational.

Computer Vision