Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:i-SRT: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective Judgment

Jun 17, 2024

Daechul Ahn, Yura Choi, San Kim, Youngjae Yu, Dongyeop Kang, Jonghyun Choi

Figure 1 for i-SRT: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective Judgment

Figure 2 for i-SRT: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective Judgment

Figure 3 for i-SRT: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective Judgment

Figure 4 for i-SRT: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective Judgment

Share this with someone who'll enjoy it:

Abstract:Aligning Video Large Multimodal Models (VLMMs) face challenges such as modality misalignment and verbose responses. Although iterative approaches such as self-rewarding or iterative direct preference optimization (DPO) recently showed a significant improvement in language model alignment, particularly on reasoning tasks, self-aligned models applied to large video-language models often result in lengthy and irrelevant responses. To address these challenges, we propose a novel method that employs self-retrospection to enhance both response generation and preference modeling, and call iterative self-retrospective judgment (i-SRT). By revisiting and evaluating already generated content and preference in loop, i-SRT improves the alignment between textual and visual modalities, reduce verbosity, and enhances content relevance. Our empirical evaluations across diverse video question answering benchmarks demonstrate that i-SRT significantly outperforms prior arts. We are committed to opensourcing our code, models, and datasets to encourage further investigation.

* Technical report

View paper on

Share this with someone who'll enjoy it:

Title:i-SRT: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective Judgment

Paper and Code