Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Correlation-guided Query-Dependency Calibration in Video Representation Learning for Temporal Grounding

Nov 18, 2023

WonJun Moon, Sangeek Hyun, SuBeen Lee, Jae-Pil Heo

Figure 1 for Correlation-guided Query-Dependency Calibration in Video Representation Learning for Temporal Grounding

Figure 2 for Correlation-guided Query-Dependency Calibration in Video Representation Learning for Temporal Grounding

Figure 3 for Correlation-guided Query-Dependency Calibration in Video Representation Learning for Temporal Grounding

Figure 4 for Correlation-guided Query-Dependency Calibration in Video Representation Learning for Temporal Grounding

Share this with someone who'll enjoy it:

Abstract:Recent endeavors in video temporal grounding enforce strong cross-modal interactions through attention mechanisms to overcome the modality gap between video and text query. However, previous works treat all video clips equally regardless of their semantic relevance with the text query in attention modules. In this paper, our goal is to provide clues for query-associated video clips within the crossmodal encoding process. With our Correlation-Guided Detection Transformer~(CG-DETR), we explore the appropriate clip-wise degree of cross-modal interactions and how to exploit such degrees for prediction. First, we design an adaptive cross-attention layer with dummy tokens. Dummy tokens conditioned by text query take a portion of the attention weights, preventing irrelevant video clips from being represented by the text query. Yet, not all word tokens equally inherit the text query's correlation to video clips. Thus, we further guide the cross-attention map by inferring the fine-grained correlation between video clips and words. We enable this by learning a joint embedding space for high-level concepts, i.e., moment and sentence level, and inferring the clip-word correlation. Lastly, we use a moment-adaptive saliency detector to exploit each video clip's degrees of text engagement. We validate the superiority of CG-DETR with the state-of-the-art results on various benchmarks for both moment retrieval and highlight detection. Codes are available at https://github.com/wjun0830/CGDETR.

* 20 pages, 14 figures, 14 tables, Code is available at https://github.com/wjun0830/CGDETR

View paper on

Share this with someone who'll enjoy it:

Title:Correlation-guided Query-Dependency Calibration in Video Representation Learning for Temporal Grounding

Paper and Code