Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Guillaume Le Moing

Scaling 4D Representations

Dec 19, 2024

João Carreira, Dilara Gokay, Michael King, Chuhan Zhang, Ignacio Rocco, Aravindh Mahendran, Thomas Albert Keck, Joseph Heyward, Skanda Koppula, Etienne Pot(+25 more)

Abstract:Scaling has not yet been convincingly demonstrated for pure self-supervised learning from video. However, prior work has focused evaluations on semantic-related tasks $\unicode{x2013}$ action classification, ImageNet classification, etc. In this paper we focus on evaluating self-supervised learning on non-semantic vision tasks that are more spatial (3D) and temporal (+1D = 4D), such as camera pose estimation, point and object tracking, and depth estimation. We show that by learning from very large video datasets, masked auto-encoding (MAE) with transformer video models actually scales, consistently improving performance on these 4D tasks, as model size increases from 20M all the way to the largest by far reported self-supervised video model $\unicode{x2013}$ 22B parameters. Rigorous apples-to-apples comparison with many recent image and video models demonstrates the benefits of scaling 4D representations.

Via

Access Paper or Ask Questions

Dense Optical Tracking: Connecting the Dots

Dec 07, 2023

Guillaume Le Moing, Jean Ponce, Cordelia Schmid

Figure 1 for Dense Optical Tracking: Connecting the Dots

Figure 2 for Dense Optical Tracking: Connecting the Dots

Figure 3 for Dense Optical Tracking: Connecting the Dots

Figure 4 for Dense Optical Tracking: Connecting the Dots

Abstract:Recent approaches to point tracking are able to recover the trajectory of any scene point through a large portion of a video despite the presence of occlusions. They are, however, too slow in practice to track every point observed in a single frame in a reasonable amount of time. This paper introduces DOT, a novel, simple and efficient method for solving this problem. It first extracts a small set of tracks from key regions at motion boundaries using an off-the-shelf point tracking algorithm. Given source and target frames, DOT then computes rough initial estimates of a dense flow field and visibility mask through nearest-neighbor interpolation, before refining them using a learnable optical flow estimator that explicitly handles occlusions and can be trained on synthetic data with ground-truth correspondences. We show that DOT is significantly more accurate than current optical flow techniques, outperforms sophisticated "universal" trackers like OmniMotion, and is on par with, or better than, the best point tracking algorithms like CoTracker while being at least two orders of magnitude faster. Quantitative and qualitative experiments with synthetic and real videos validate the promise of the proposed approach. Code, data, and videos showcasing the capabilities of our approach are available in the project webpage: https://16lemoing.github.io/dot .

Via

Access Paper or Ask Questions

WALDO: Future Video Synthesis using Object Layer Decomposition and Parametric Flow Prediction

Nov 25, 2022

Guillaume Le Moing, Jean Ponce, Cordelia Schmid

Figure 1 for WALDO: Future Video Synthesis using Object Layer Decomposition and Parametric Flow Prediction

Figure 2 for WALDO: Future Video Synthesis using Object Layer Decomposition and Parametric Flow Prediction

Figure 3 for WALDO: Future Video Synthesis using Object Layer Decomposition and Parametric Flow Prediction

Figure 4 for WALDO: Future Video Synthesis using Object Layer Decomposition and Parametric Flow Prediction

Abstract:This paper presents WALDO (WArping Layer-Decomposed Objects), a novel approach to the prediction of future video frames from past ones. Individual images are decomposed into multiple layers combining object masks and a small set of control points. The layer structure is shared across all frames in each video to build dense inter-frame connections. Complex scene motions are modeled by combining parametric geometric transformations associated with individual layers, and video synthesis is broken down into discovering the layers associated with past frames, predicting the corresponding transformations for upcoming ones and warping the associated object regions accordingly, and filling in the remaining image parts. Extensive experiments on the Cityscapes (resp. KITTI) dataset show that WALDO significantly outperforms prior works with, e.g., 3, 27, and 51% (resp. 5, 20 and 11%) relative improvement in SSIM, LPIPS and FVD metrics. Code, pretrained models, and video samples synthesized by our approach can be found in the project webpage https://16lemoing.github.io/waldo.

Via

Access Paper or Ask Questions

CCVS: Context-aware Controllable Video Synthesis

Jul 16, 2021

Guillaume Le Moing, Jean Ponce, Cordelia Schmid

Figure 1 for CCVS: Context-aware Controllable Video Synthesis

Figure 2 for CCVS: Context-aware Controllable Video Synthesis

Figure 3 for CCVS: Context-aware Controllable Video Synthesis

Figure 4 for CCVS: Context-aware Controllable Video Synthesis

Abstract:This presentation introduces a self-supervised learning approach to the synthesis of new video clips from old ones, with several new key elements for improved spatial resolution and realism: It conditions the synthesis process on contextual information for temporal continuity and ancillary information for fine control. The prediction model is doubly autoregressive, in the latent space of an autoencoder for forecasting, and in image space for updating contextual information, which is also used to enforce spatio-temporal consistency through a learnable optical flow module. Adversarial training of the autoencoder in the appearance and temporal domains is used to further improve the realism of its output. A quantizer inserted between the encoder and the transformer in charge of forecasting future frames in latent space (and its inverse inserted between the transformer and the decoder) adds even more flexibility by affording simple mechanisms for handling multimodal ancillary information for controlling the synthesis process (eg, a few sample frames, an audio track, a trajectory in image space) and taking into account the intrinsically uncertain nature of the future by allowing multiple predictions. Experiments with an implementation of the proposed approach give very good qualitative and quantitative results on multiple tasks and standard benchmarks.

Via

Access Paper or Ask Questions

Semantic Palette: Guiding Scene Generation with Class Proportions

Jun 03, 2021

Guillaume Le Moing, Tuan-Hung Vu, Himalaya Jain, Patrick Pérez, Matthieu Cord

Figure 1 for Semantic Palette: Guiding Scene Generation with Class Proportions

Figure 2 for Semantic Palette: Guiding Scene Generation with Class Proportions

Figure 3 for Semantic Palette: Guiding Scene Generation with Class Proportions

Figure 4 for Semantic Palette: Guiding Scene Generation with Class Proportions

Abstract:Despite the recent progress of generative adversarial networks (GANs) at synthesizing photo-realistic images, producing complex urban scenes remains a challenging problem. Previous works break down scene generation into two consecutive phases: unconditional semantic layout synthesis and image synthesis conditioned on layouts. In this work, we propose to condition layout generation as well for higher semantic control: given a vector of class proportions, we generate layouts with matching composition. To this end, we introduce a conditional framework with novel architecture designs and learning objectives, which effectively accommodates class proportions to guide the scene generation process. The proposed architecture also allows partial layout editing with interesting applications. Thanks to the semantic control, we can produce layouts close to the real distribution, helping enhance the whole scene generation process. On different metrics and urban scene benchmarks, our models outperform existing baselines. Moreover, we demonstrate the merit of our approach for data augmentation: semantic segmenters trained on real layout-image pairs along with additional ones generated by our approach outperform models only trained on real pairs.

* Accepted to IEEE CVPR 2021

Via

Access Paper or Ask Questions

Data-Efficient Framework for Real-world Multiple Sound Source 2D Localization

Dec 10, 2020

Guillaume Le Moing, Phongtharin Vinayavekhin, Don Joven Agravante, Tadanobu Inoue, Jayakorn Vongkulbhisal, Asim Munawar, Ryuki Tachibana

Figure 1 for Data-Efficient Framework for Real-world Multiple Sound Source 2D Localization

Figure 2 for Data-Efficient Framework for Real-world Multiple Sound Source 2D Localization

Figure 3 for Data-Efficient Framework for Real-world Multiple Sound Source 2D Localization

Figure 4 for Data-Efficient Framework for Real-world Multiple Sound Source 2D Localization

Abstract:Deep neural networks have recently led to promising results for the task of multiple sound source localization. Yet, they require a lot of training data to cover a variety of acoustic conditions and microphone array layouts. One can leverage acoustic simulators to inexpensively generate labeled training data. However, models trained on synthetic data tend to perform poorly with real-world recordings due to the domain mismatch. Moreover, learning for different microphone array layouts makes the task more complicated due to the infinite number of possible layouts. We propose to use adversarial learning methods to close the gap between synthetic and real domains. Our novel ensemble-discrimination method significantly improves the localization performance without requiring any label from the real data. Furthermore, we propose a novel explicit transformation layer to be embedded in the localization architecture. It enables the model to be trained with data from specific microphone array layouts while generalizing well to unseen layouts during inference.

* Submitted to IEEE ICASSP 2021

Via

Access Paper or Ask Questions

Ensemble of Discriminators for Domain Adaptation in Multiple Sound Source 2D Localization

Dec 10, 2020

Guillaume Le Moing, Don Joven Agravante, Tadanobu Inoue, Jayakorn Vongkulbhisal, Asim Munawar, Ryuki Tachibana, Phongtharin Vinayavekhin

Figure 1 for Ensemble of Discriminators for Domain Adaptation in Multiple Sound Source 2D Localization

Figure 2 for Ensemble of Discriminators for Domain Adaptation in Multiple Sound Source 2D Localization

Figure 3 for Ensemble of Discriminators for Domain Adaptation in Multiple Sound Source 2D Localization

Figure 4 for Ensemble of Discriminators for Domain Adaptation in Multiple Sound Source 2D Localization

Abstract:This paper introduces an ensemble of discriminators that improves the accuracy of a domain adaptation technique for the localization of multiple sound sources. Recently, deep neural networks have led to promising results for this task, yet they require a large amount of labeled data for training. Recording and labeling such datasets is very costly, especially because data needs to be diverse enough to cover different acoustic conditions. In this paper, we leverage acoustic simulators to inexpensively generate labeled training samples. However, models trained on synthetic data tend to perform poorly with real-world recordings due to the domain mismatch. For this, we explore two domain adaptation methods using adversarial learning for sound source localization which use labeled synthetic data and unlabeled real data. We propose a novel ensemble approach that combines discriminators applied at different feature levels of the localization model. Experiments show that our ensemble discrimination method significantly improves the localization performance without requiring any label from the real data.

* arXiv admin note: substantial text overlap with arXiv:2012.05533

Via

Access Paper or Ask Questions

Learning Multiple Sound Source 2D Localization

Dec 10, 2020

Guillaume Le Moing, Phongtharin Vinayavekhin, Tadanobu Inoue, Jayakorn Vongkulbhisal, Asim Munawar, Ryuki Tachibana, Don Joven Agravante

Figure 1 for Learning Multiple Sound Source 2D Localization

Figure 2 for Learning Multiple Sound Source 2D Localization

Figure 3 for Learning Multiple Sound Source 2D Localization

Figure 4 for Learning Multiple Sound Source 2D Localization

Abstract:In this paper, we propose novel deep learning based algorithms for multiple sound source localization. Specifically, we aim to find the 2D Cartesian coordinates of multiple sound sources in an enclosed environment by using multiple microphone arrays. To this end, we use an encoding-decoding architecture and propose two improvements on it to accomplish the task. In addition, we also propose two novel localization representations which increase the accuracy. Lastly, new metrics are developed relying on resolution-based multiple source association which enables us to evaluate and compare different localization approaches. We tested our method on both synthetic and real world data. The results show that our method improves upon the previous baseline approach for this problem.

* Published in: 2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP)

Via

Access Paper or Ask Questions