Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yunzhuo Sun

Mamba-VMR: Multimodal Query Augmentation via Generated Videos for Precise Temporal Grounding

Mar 23, 2026

Yunzhuo Sun, Xinyue Liu, Yanyang Li, Nanding Wu, Yifang Xu, Linlin Zong, Xianchao Zhang, Wenxin Liang

Abstract:Text-driven video moment retrieval (VMR) remains challenging due to limited capture of hidden temporal dynamics in untrimmed videos, leading to imprecise grounding in long sequences. Traditional methods rely on natural language queries (NLQs) or static image augmentations, overlooking motion sequences and suffering from high computational costs in Transformer-based architectures. Existing approaches fail to integrate subtitle contexts and generated temporal priors effectively, we therefore propose a novel two-stage framework for enhanced temporal grounding. In the first stage, LLM-guided subtitle matching identifies relevant textual cues from video subtitles, fused with the query to generate auxiliary short videos via text-to-video models, capturing implicit motion information as temporal priors. In the second stage, augmented queries are processed through a multi-modal controlled Mamba network, extending text-controlled selection with video-guided gating for efficient fusion of generated priors and long sequences while filtering noise. Our framework is agnostic to base retrieval models and widely applicable for multimodal VMR. Experimental evaluations on the TVR benchmark demonstrate significant improvements over state-of-the-art methods, including reduced computational overhead and higher recall in long-sequence grounding.

* The paper is accepted by CVPR-2026

Via

Access Paper or Ask Questions

HiFi-Portrait: Zero-shot Identity-preserved Portrait Generation with High-fidelity Multi-face Fusion

Dec 16, 2025

Yifang Xu, Benxiang Zhai, Yunzhuo Sun, Ming Li, Yang Li, Sidan Du

Figure 1 for HiFi-Portrait: Zero-shot Identity-preserved Portrait Generation with High-fidelity Multi-face Fusion

Figure 2 for HiFi-Portrait: Zero-shot Identity-preserved Portrait Generation with High-fidelity Multi-face Fusion

Figure 3 for HiFi-Portrait: Zero-shot Identity-preserved Portrait Generation with High-fidelity Multi-face Fusion

Figure 4 for HiFi-Portrait: Zero-shot Identity-preserved Portrait Generation with High-fidelity Multi-face Fusion

Abstract:Recent advancements in diffusion-based technologies have made significant strides, particularly in identity-preserved portrait generation (IPG). However, when using multiple reference images from the same ID, existing methods typically produce lower-fidelity portraits and struggle to customize face attributes precisely. To address these issues, this paper presents HiFi-Portrait, a high-fidelity method for zero-shot portrait generation. Specifically, we first introduce the face refiner and landmark generator to obtain fine-grained multi-face features and 3D-aware face landmarks. The landmarks include the reference ID and the target attributes. Then, we design HiFi-Net to fuse multi-face features and align them with landmarks, which improves ID fidelity and face control. In addition, we devise an automated pipeline to construct an ID-based dataset for training HiFi-Portrait. Extensive experimental results demonstrate that our method surpasses the SOTA approaches in face similarity and controllability. Furthermore, our method is also compatible with previous SDXL-based works.

* Accepted by CVPR 2025

Via

Access Paper or Ask Questions

Multi-modal Fusion and Query Refinement Network for Video Moment Retrieval and Highlight Detection

Jan 18, 2025

Yifang Xu, Yunzhuo Sun, Benxiang Zhai, Zien Xie, Youyao Jia, Sidan Du

Figure 1 for Multi-modal Fusion and Query Refinement Network for Video Moment Retrieval and Highlight Detection

Figure 2 for Multi-modal Fusion and Query Refinement Network for Video Moment Retrieval and Highlight Detection

Figure 3 for Multi-modal Fusion and Query Refinement Network for Video Moment Retrieval and Highlight Detection

Figure 4 for Multi-modal Fusion and Query Refinement Network for Video Moment Retrieval and Highlight Detection

Abstract:Given a video and a linguistic query, video moment retrieval and highlight detection (MR&HD) aim to locate all the relevant spans while simultaneously predicting saliency scores. Most existing methods utilize RGB images as input, overlooking the inherent multi-modal visual signals like optical flow and depth. In this paper, we propose a Multi-modal Fusion and Query Refinement Network (MRNet) to learn complementary information from multi-modal cues. Specifically, we design a multi-modal fusion module to dynamically combine RGB, optical flow, and depth map. Furthermore, to simulate human understanding of sentences, we introduce a query refinement module that merges text at different granularities, containing word-, phrase-, and sentence-wise levels. Comprehensive experiments on QVHighlights and Charades datasets indicate that MRNet outperforms current state-of-the-art methods, achieving notable improvements in MR-mAP@Avg (+3.41) and HD-HIT@1 (+3.46) on QVHighlights.

* Accepted by ICME 2024

Via

Access Paper or Ask Questions

Zero-shot Video Moment Retrieval via Off-the-shelf Multimodal Large Language Models

Jan 14, 2025

Yifang Xu, Yunzhuo Sun, Benxiang Zhai, Ming Li, Wenxin Liang, Yang Li, Sidan Du

Figure 1 for Zero-shot Video Moment Retrieval via Off-the-shelf Multimodal Large Language Models

Figure 2 for Zero-shot Video Moment Retrieval via Off-the-shelf Multimodal Large Language Models

Figure 3 for Zero-shot Video Moment Retrieval via Off-the-shelf Multimodal Large Language Models

Figure 4 for Zero-shot Video Moment Retrieval via Off-the-shelf Multimodal Large Language Models

Abstract:The target of video moment retrieval (VMR) is predicting temporal spans within a video that semantically match a given linguistic query. Existing VMR methods based on multimodal large language models (MLLMs) overly rely on expensive high-quality datasets and time-consuming fine-tuning. Although some recent studies introduce a zero-shot setting to avoid fine-tuning, they overlook inherent language bias in the query, leading to erroneous localization. To tackle the aforementioned challenges, this paper proposes Moment-GPT, a tuning-free pipeline for zero-shot VMR utilizing frozen MLLMs. Specifically, we first employ LLaMA-3 to correct and rephrase the query to mitigate language bias. Subsequently, we design a span generator combined with MiniGPT-v2 to produce candidate spans adaptively. Finally, to leverage the video comprehension capabilities of MLLMs, we apply VideoChatGPT and span scorer to select the most appropriate spans. Our proposed method substantially outperforms the state-ofthe-art MLLM-based and zero-shot models on several public datasets, including QVHighlights, ActivityNet-Captions, and Charades-STA.

* Accepted by AAAI 2025

Via

Access Paper or Ask Questions

GPTSee: Enhancing Moment Retrieval and Highlight Detection via Description-Based Similarity Features

Mar 10, 2024

Yunzhuo Sun, Yifang Xu, Zien Xie, Yukun Shu, Sidan Du

Figure 1 for GPTSee: Enhancing Moment Retrieval and Highlight Detection via Description-Based Similarity Features

Figure 2 for GPTSee: Enhancing Moment Retrieval and Highlight Detection via Description-Based Similarity Features

Figure 3 for GPTSee: Enhancing Moment Retrieval and Highlight Detection via Description-Based Similarity Features

Figure 4 for GPTSee: Enhancing Moment Retrieval and Highlight Detection via Description-Based Similarity Features

Abstract:Moment retrieval (MR) and highlight detection (HD) aim to identify relevant moments and highlights in video from corresponding natural language query. Large language models (LLMs) have demonstrated proficiency in various computer vision tasks. However, existing methods for MR\&HD have not yet been integrated with LLMs. In this letter, we propose a novel two-stage model that takes the output of LLMs as the input to the second-stage transformer encoder-decoder. First, MiniGPT-4 is employed to generate the detailed description of the video frame and rewrite the query statement, fed into the encoder as new features. Then, semantic similarity is computed between the generated description and the rewritten queries. Finally, continuous high-similarity video frames are converted into span anchors, serving as prior position information for the decoder. Experiments demonstrate that our approach achieves a state-of-the-art result, and by using only span anchors and similarity scores as outputs, positioning accuracy outperforms traditional methods, like Moment-DETR.

* 5 pages, 3 figures

Via

Access Paper or Ask Questions

VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT

Mar 04, 2024

Yifang Xu, Yunzhuo Sun, Zien Xie, Benxiang Zhai, Sidan Du

Figure 1 for VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT

Figure 2 for VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT

Figure 3 for VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT

Figure 4 for VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT

Abstract:Video temporal grounding (VTG) aims to locate specific temporal segments from an untrimmed video based on a linguistic query. Most existing VTG models are trained on extensive annotated video-text pairs, a process that not only introduces human biases from the queries but also incurs significant computational costs. To tackle these challenges, we propose VTG-GPT, a GPT-based method for zero-shot VTG without training or fine-tuning. To reduce prejudice in the original query, we employ Baichuan2 to generate debiased queries. To lessen redundant information in videos, we apply MiniGPT-v2 to transform visual content into more precise captions. Finally, we devise the proposal generator and post-processing to produce accurate segments from debiased queries and image captions. Extensive experiments demonstrate that VTG-GPT significantly outperforms SOTA methods in zero-shot settings and surpasses unsupervised approaches. More notably, it achieves competitive performance comparable to supervised methods. The code is available on https://github.com/YoucanBaby/VTG-GPT

* 15 pages, 7 figures

Via

Access Paper or Ask Questions

MH-DETR: Video Moment and Highlight Detection with Cross-modal Transformer

Apr 29, 2023

Yifang Xu, Yunzhuo Sun, Yang Li, Yilei Shi, Xiaoxiang Zhu, Sidan Du

Figure 1 for MH-DETR: Video Moment and Highlight Detection with Cross-modal Transformer

Figure 2 for MH-DETR: Video Moment and Highlight Detection with Cross-modal Transformer

Figure 3 for MH-DETR: Video Moment and Highlight Detection with Cross-modal Transformer

Figure 4 for MH-DETR: Video Moment and Highlight Detection with Cross-modal Transformer

Abstract:With the increasing demand for video understanding, video moment and highlight detection (MHD) has emerged as a critical research topic. MHD aims to localize all moments and predict clip-wise saliency scores simultaneously. Despite progress made by existing DETR-based methods, we observe that these methods coarsely fuse features from different modalities, which weakens the temporal intra-modal context and results in insufficient cross-modal interaction. To address this issue, we propose MH-DETR (Moment and Highlight Detection Transformer) tailored for MHD. Specifically, we introduce a simple yet efficient pooling operator within the uni-modal encoder to capture global intra-modal context. Moreover, to obtain temporally aligned cross-modal features, we design a plug-and-play cross-modal interaction module between the encoder and decoder, seamlessly integrating visual and textual features. Comprehensive experiments on QVHighlights, Charades-STA, Activity-Net, and TVSum datasets show that MH-DETR outperforms existing state-of-the-art methods, demonstrating its effectiveness and superiority. Our code is available at https://github.com/YoucanBaby/MH-DETR.

Via

Access Paper or Ask Questions