What can a cook in Italy teach a mechanic in India?
Action Recognition Generalisation Over Scenarios and Locations

Chiara Plizzari Politecnico di Torino
Toby Perrett University of Bristol
Barbara Caputo Politecnico di Torino, IIT
Dima Damen University of Bristol
ICCV 2023
[Paper]
[Project]
[Dataset]
[Video]

Motivation: the same action, for example "cut", can be performed very differently based on the scenario and the geographical location in which it is performed.

Trulli





Abstract: We propose and address a new generalisation problem: can a model trained for action recognition successfully classify actions when they are performed within a previously unseen scenario and in a previously unseen location? To answer this question, we introduce the Action Recognition Generalisation Over scenarios and locations dataset (ARGO1M), which contains 1.1M video clips from the large-scale Ego4D dataset, across 10 scenarios and 13 locations. We demonstrate recognition models struggle to generalise over 10 proposed test splits, each of an unseen scenario in an unseen location. We thus propose CIR, a method to represent each video as a Cross-Instance Reconstruction of videos from other domains. Reconstructions are paired with text narrations to guide the learning of a domain generalisable representation. We provide extensive analysis and ablations on ARGO1M that show CIR outperforms prior domain generalisation works on all test splits.



Dataset: ARGO1M

We introduce the Action Recognition Generalisation Over scenarios and locations dataset (ARGO1M), which contains 1.1M video clips from the large-scale Ego4D dataset, across 10 scenarios and 13 locations. For downloading ARGO1M, you have to follow three main steps:
1. Sign EGO4D License Agreement and get AWS access credentials;
2. Download the AWS CLI;
3. Download the dataset from AWS servers.
Follow details in ARGO1M.
Trulli
Fig: Frequency (log-scale) of the 60 classes in ARGO1M across scenarios (top) and locations (bottom) - % in legend. Scenarios and locations are linearly scaled within each bar.

Method: CIR

We propose Cross-Instance Reconstruction (CIR) to represent an action as a weighted combination of actions from other scenarios and locations. We propose two reconstructions, each guided by a different objective. The video-text association reconstruction uses text narrations so these cross-instance reconstructions are associated with the video clip’s semantic description. The classification reconstruction is trained to recognise the clip’s action class.
Trulli
In the following, we illustrate our pipeline.
Trulli
Fig: One clip and corresponding narration are shown along with the support set of other clips in the batch. Video f (v) and text g(t) embeddings are extracted using trained encoders on top of a frozen model. Cross entropy Lc, and two CIR objectives Lrt and Lrc are minimized. For Lrt , query Q and key K projections are learnt for clips in the batch, followed by self-masking. Weights are multiplied by f (v), and the reconstructed ⊕v is paired with the corresponding narration. For Lrc, ⊕v is classified using the classifier h. At inference, only the video classifier h is used.

CIR visualisation

We visualise reconstructed instances from the training set by CIR and their (top-5) support set. On the top left, we show the query video clip, along with the corresponding narration (top of the video), scenario (icon on the top-right of the video), and location (pin on the top-right map). On the bottom row, we show the j-th support video clip, along with its narration (top of the video), scenario (icon below) and location (pin on the map).

Full video

YouTube

Bibtex

@inproceedings{Plizzari2023, title={What can a cook in Italy teach a mechanic in India? Action Recognition Generalisation Over Scenarios and Locations},
author={Plizzari, Chiara and Perrett, Toby and Caputo, Barbara and Damen, Dima},
booktitle={ICCV2023},
year={2023} }


                    

Acknowledgements
Research at Bristol is supported by EPSRC Fellowship UMPIRE (EP/T004991/1) and EPSRC Program Grant Visual AI (EP/T028572/1). This project ac- knowledges the use of University of Bristol’s Blue Crystal 4 (BC4) HPC facilities. We also acknowledge travel support from ELISE (GA no 951847).

Original Website Template