Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dimitri Zhukov

Reconstructing and grounding narrated instructional videos in 3D

Sep 10, 2021

Dimitri Zhukov, Ignacio Rocco, Ivan Laptev, Josef Sivic, Johannes L. Schönberger, Bugra Tekin, Marc Pollefeys

Figure 1 for Reconstructing and grounding narrated instructional videos in 3D

Figure 2 for Reconstructing and grounding narrated instructional videos in 3D

Figure 3 for Reconstructing and grounding narrated instructional videos in 3D

Figure 4 for Reconstructing and grounding narrated instructional videos in 3D

Abstract:Narrated instructional videos often show and describe manipulations of similar objects, e.g., repairing a particular model of a car or laptop. In this work we aim to reconstruct such objects and to localize associated narrations in 3D. Contrary to the standard scenario of instance-level 3D reconstruction, where identical objects or scenes are present in all views, objects in different instructional videos may have large appearance variations given varying conditions and versions of the same product. Narrations may also have large variation in natural language expressions. We address these challenges by three contributions. First, we propose an approach for correspondence estimation combining learnt local features and dense flow. Second, we design a two-step divide and conquer reconstruction approach where the initial 3D reconstructions of individual videos are combined into a 3D alignment graph. Finally, we propose an unsupervised approach to ground natural language in obtained 3D reconstructions. We demonstrate the effectiveness of our approach for the domain of car maintenance. Given raw instructional videos and no manual supervision, our method successfully reconstructs engines of different car models and associates textual descriptions with corresponding objects in 3D.

Via

Access Paper or Ask Questions

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Jul 31, 2019

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, Josef Sivic

Figure 1 for HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Figure 2 for HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Figure 3 for HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Figure 4 for HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Abstract:Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. However, such datasets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. The contributions of this work are three-fold. First, we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Finally, we show that this embedding transfers well to other domains: fine-tuning on generic Youtube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms models trained on these datasets alone. Our dataset, code and models will be publicly available at: www.di.ens.fr/willow/research/howto100m/.

* Accepted at ICCV 2019

Via

Access Paper or Ask Questions

Cross-task weakly supervised learning from instructional videos

Mar 19, 2019

Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk Cinbis, David Fouhey, Ivan Laptev, Josef Sivic

Figure 1 for Cross-task weakly supervised learning from instructional videos

Figure 2 for Cross-task weakly supervised learning from instructional videos

Figure 3 for Cross-task weakly supervised learning from instructional videos

Figure 4 for Cross-task weakly supervised learning from instructional videos

Abstract:In this paper we investigate learning visual models for the steps of ordinary tasks using weak supervision via instructional narrations and an ordered list of steps instead of strong supervision via temporal annotations. At the heart of our approach is the observation that weakly supervised learning may be easier if a model shares components while learning different steps: `pour egg' should be trained jointly with other tasks involving `pour' and `egg'. We formalize this in a component model for recognizing steps and a weakly supervised learning framework that can learn this model under temporal constraints from narration and the list of steps. Past data does not permit systematic studying of sharing and so we also gather a new dataset, CrossTask, aimed at assessing cross-task sharing. Our experiments demonstrate that sharing across tasks improves performance, especially when done at the component level and that our component model can parse previously unseen tasks by virtue of its compositionality.

* 10 pages, 7 figures, to be published in proceedings of the CVPR, 2019

Via

Access Paper or Ask Questions