Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yifang Xu

Multi-modal Fusion and Query Refinement Network for Video Moment Retrieval and Highlight Detection

Jan 18, 2025

Yifang Xu, Yunzhuo Sun, Benxiang Zhai, Zien Xie, Youyao Jia, Sidan Du

Figure 1 for Multi-modal Fusion and Query Refinement Network for Video Moment Retrieval and Highlight Detection

Figure 2 for Multi-modal Fusion and Query Refinement Network for Video Moment Retrieval and Highlight Detection

Figure 3 for Multi-modal Fusion and Query Refinement Network for Video Moment Retrieval and Highlight Detection

Figure 4 for Multi-modal Fusion and Query Refinement Network for Video Moment Retrieval and Highlight Detection

Abstract:Given a video and a linguistic query, video moment retrieval and highlight detection (MR&HD) aim to locate all the relevant spans while simultaneously predicting saliency scores. Most existing methods utilize RGB images as input, overlooking the inherent multi-modal visual signals like optical flow and depth. In this paper, we propose a Multi-modal Fusion and Query Refinement Network (MRNet) to learn complementary information from multi-modal cues. Specifically, we design a multi-modal fusion module to dynamically combine RGB, optical flow, and depth map. Furthermore, to simulate human understanding of sentences, we introduce a query refinement module that merges text at different granularities, containing word-, phrase-, and sentence-wise levels. Comprehensive experiments on QVHighlights and Charades datasets indicate that MRNet outperforms current state-of-the-art methods, achieving notable improvements in MR-mAP@Avg (+3.41) and HD-HIT@1 (+3.46) on QVHighlights.

* Accepted by ICME 2024

Via

Access Paper or Ask Questions

Zero-shot Video Moment Retrieval via Off-the-shelf Multimodal Large Language Models

Jan 14, 2025

Yifang Xu, Yunzhuo Sun, Benxiang Zhai, Ming Li, Wenxin Liang, Yang Li, Sidan Du

Figure 1 for Zero-shot Video Moment Retrieval via Off-the-shelf Multimodal Large Language Models

Figure 2 for Zero-shot Video Moment Retrieval via Off-the-shelf Multimodal Large Language Models

Figure 3 for Zero-shot Video Moment Retrieval via Off-the-shelf Multimodal Large Language Models

Figure 4 for Zero-shot Video Moment Retrieval via Off-the-shelf Multimodal Large Language Models

Abstract:The target of video moment retrieval (VMR) is predicting temporal spans within a video that semantically match a given linguistic query. Existing VMR methods based on multimodal large language models (MLLMs) overly rely on expensive high-quality datasets and time-consuming fine-tuning. Although some recent studies introduce a zero-shot setting to avoid fine-tuning, they overlook inherent language bias in the query, leading to erroneous localization. To tackle the aforementioned challenges, this paper proposes Moment-GPT, a tuning-free pipeline for zero-shot VMR utilizing frozen MLLMs. Specifically, we first employ LLaMA-3 to correct and rephrase the query to mitigate language bias. Subsequently, we design a span generator combined with MiniGPT-v2 to produce candidate spans adaptively. Finally, to leverage the video comprehension capabilities of MLLMs, we apply VideoChatGPT and span scorer to select the most appropriate spans. Our proposed method substantially outperforms the state-ofthe-art MLLM-based and zero-shot models on several public datasets, including QVHighlights, ActivityNet-Captions, and Charades-STA.

* Accepted by AAAI 2025

Via

Access Paper or Ask Questions

NTIRE 2024 Challenge on Short-form UGC Video Quality Assessment: Methods and Results

Apr 17, 2024

Xin Li, Kun Yuan, Yajing Pei, Yiting Lu, Ming Sun, Chao Zhou, Zhibo Chen, Radu Timofte, Wei Sun, Haoning Wu(+58 more)

Figure 1 for NTIRE 2024 Challenge on Short-form UGC Video Quality Assessment: Methods and Results

Figure 2 for NTIRE 2024 Challenge on Short-form UGC Video Quality Assessment: Methods and Results

Figure 3 for NTIRE 2024 Challenge on Short-form UGC Video Quality Assessment: Methods and Results

Figure 4 for NTIRE 2024 Challenge on Short-form UGC Video Quality Assessment: Methods and Results

Abstract:This paper reviews the NTIRE 2024 Challenge on Shortform UGC Video Quality Assessment (S-UGC VQA), where various excellent solutions are submitted and evaluated on the collected dataset KVQ from popular short-form video platform, i.e., Kuaishou/Kwai Platform. The KVQ database is divided into three parts, including 2926 videos for training, 420 videos for validation, and 854 videos for testing. The purpose is to build new benchmarks and advance the development of S-UGC VQA. The competition had 200 participants and 13 teams submitted valid solutions for the final testing phase. The proposed solutions achieved state-of-the-art performances for S-UGC VQA. The project can be found at https://github.com/lixinustc/KVQChallenge-CVPR-NTIRE2024.

* Accepted by CVPR2024 Workshop. The challenge report for CVPR NTIRE2024 Short-form UGC Video Quality Assessment Challenge

Via

Access Paper or Ask Questions

GPTSee: Enhancing Moment Retrieval and Highlight Detection via Description-Based Similarity Features

Mar 10, 2024

Yunzhuo Sun, Yifang Xu, Zien Xie, Yukun Shu, Sidan Du

Abstract:Moment retrieval (MR) and highlight detection (HD) aim to identify relevant moments and highlights in video from corresponding natural language query. Large language models (LLMs) have demonstrated proficiency in various computer vision tasks. However, existing methods for MR\&HD have not yet been integrated with LLMs. In this letter, we propose a novel two-stage model that takes the output of LLMs as the input to the second-stage transformer encoder-decoder. First, MiniGPT-4 is employed to generate the detailed description of the video frame and rewrite the query statement, fed into the encoder as new features. Then, semantic similarity is computed between the generated description and the rewritten queries. Finally, continuous high-similarity video frames are converted into span anchors, serving as prior position information for the decoder. Experiments demonstrate that our approach achieves a state-of-the-art result, and by using only span anchors and similarity scores as outputs, positioning accuracy outperforms traditional methods, like Moment-DETR.

* 5 pages, 3 figures

Via

Access Paper or Ask Questions

VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT

Mar 04, 2024

Yifang Xu, Yunzhuo Sun, Zien Xie, Benxiang Zhai, Sidan Du

Abstract:Video temporal grounding (VTG) aims to locate specific temporal segments from an untrimmed video based on a linguistic query. Most existing VTG models are trained on extensive annotated video-text pairs, a process that not only introduces human biases from the queries but also incurs significant computational costs. To tackle these challenges, we propose VTG-GPT, a GPT-based method for zero-shot VTG without training or fine-tuning. To reduce prejudice in the original query, we employ Baichuan2 to generate debiased queries. To lessen redundant information in videos, we apply MiniGPT-v2 to transform visual content into more precise captions. Finally, we devise the proposal generator and post-processing to produce accurate segments from debiased queries and image captions. Extensive experiments demonstrate that VTG-GPT significantly outperforms SOTA methods in zero-shot settings and surpasses unsupervised approaches. More notably, it achieves competitive performance comparable to supervised methods. The code is available on https://github.com/YoucanBaby/VTG-GPT

* 15 pages, 7 figures

Via

Access Paper or Ask Questions

Pyramid Feature Attention Network for Monocular Depth Prediction

Mar 03, 2024

Yifang Xu, Chenglei Peng, Ming Li, Yang Li, Sidan Du

Abstract:Deep convolutional neural networks (DCNNs) have achieved great success in monocular depth estimation (MDE). However, few existing works take the contributions for MDE of different levels feature maps into account, leading to inaccurate spatial layout, ambiguous boundaries and discontinuous object surface in the prediction. To better tackle these problems, we propose a Pyramid Feature Attention Network (PFANet) to improve the high-level context features and low-level spatial features. In the proposed PFANet, we design a Dual-scale Channel Attention Module (DCAM) to employ channel attention in different scales, which aggregate global context and local information from the high-level feature maps. To exploit the spatial relationship of visual features, we design a Spatial Pyramid Attention Module (SPAM) which can guide the network attention to multi-scale detailed information in the low-level feature maps. Finally, we introduce scale-invariant gradient loss to increase the penalty on errors in depth-wise discontinuous regions. Experimental results show that our method outperforms state-of-the-art methods on the KITTI dataset.

* 6 pages, 5 figures

Via

Access Paper or Ask Questions

MH-DETR: Video Moment and Highlight Detection with Cross-modal Transformer

Apr 29, 2023

Yifang Xu, Yunzhuo Sun, Yang Li, Yilei Shi, Xiaoxiang Zhu, Sidan Du

Figure 1 for MH-DETR: Video Moment and Highlight Detection with Cross-modal Transformer

Figure 2 for MH-DETR: Video Moment and Highlight Detection with Cross-modal Transformer

Figure 3 for MH-DETR: Video Moment and Highlight Detection with Cross-modal Transformer

Figure 4 for MH-DETR: Video Moment and Highlight Detection with Cross-modal Transformer

Abstract:With the increasing demand for video understanding, video moment and highlight detection (MHD) has emerged as a critical research topic. MHD aims to localize all moments and predict clip-wise saliency scores simultaneously. Despite progress made by existing DETR-based methods, we observe that these methods coarsely fuse features from different modalities, which weakens the temporal intra-modal context and results in insufficient cross-modal interaction. To address this issue, we propose MH-DETR (Moment and Highlight Detection Transformer) tailored for MHD. Specifically, we introduce a simple yet efficient pooling operator within the uni-modal encoder to capture global intra-modal context. Moreover, to obtain temporally aligned cross-modal features, we design a plug-and-play cross-modal interaction module between the encoder and decoder, seamlessly integrating visual and textual features. Comprehensive experiments on QVHighlights, Charades-STA, Activity-Net, and TVSum datasets show that MH-DETR outperforms existing state-of-the-art methods, demonstrating its effectiveness and superiority. Our code is available at https://github.com/YoucanBaby/MH-DETR.

Via

Access Paper or Ask Questions

Learning-based Natural Geometric Matching with Homography Prior

Jul 13, 2018

Yifang Xu, Tianli Liao, Jing Chen

Figure 1 for Learning-based Natural Geometric Matching with Homography Prior

Figure 2 for Learning-based Natural Geometric Matching with Homography Prior

Figure 3 for Learning-based Natural Geometric Matching with Homography Prior

Figure 4 for Learning-based Natural Geometric Matching with Homography Prior

Abstract:Geometric matching is a key step in computer vision tasks. Previous learning-based methods for geometric matching concentrate more on improving alignment quality, while we argue the importance of naturalness issue simultaneously. To deal with this, firstly, Pearson correlation is applied to handle large intra-class variations of features in feature matching stage. Then, we parametrize homography transformation with 9 parameters in full connected layer of our network, to better characterize large viewpoint variations compared with affine transformation. Furthermore, a novel loss function with Gaussian weights guarantees the model accuracy and efficiency in training procedure. Finally, we provide two choices for different purposes in geometric matching. When compositing homography with affine transformation, the alignment accuracy improves and all lines are preserved, which results in a more natural transformed image. When compositing homography with non-rigid thin-plate-spline transformation, the alignment accuracy further improves. Experimental results on Proposal Flow dataset show that our method outperforms state-of-the-art methods, both in terms of alignment accuracy and naturalness.

* 13 pages,4 figures

Via

Access Paper or Ask Questions

Coarse-to-fine Seam Estimation for Image Stitching

May 24, 2018

Tianli Liao, Jing Chen, Yifang Xu

Figure 1 for Coarse-to-fine Seam Estimation for Image Stitching

Figure 2 for Coarse-to-fine Seam Estimation for Image Stitching

Figure 3 for Coarse-to-fine Seam Estimation for Image Stitching

Figure 4 for Coarse-to-fine Seam Estimation for Image Stitching

Abstract:Seam-cutting and seam-driven techniques have been proven effective for handling imperfect image series in image stitching. Generally, seam-driven is to utilize seam-cutting to find a best seam from one or finite alignment hypotheses based on a predefined seam quality metric. However, the quality metrics in most methods are defined to measure the average performance of the pixels on the seam without considering the relevance and variance among them. This may cause that the seam with the minimal measure is not optimal (perception-inconsistent) in human perception. In this paper, we propose a novel coarse-to-fine seam estimation method which applies the evaluation in a different way. For pixels on the seam, we develop a patch-point evaluation algorithm concentrating more on the correlation and variation of them. The evaluations are then used to recalculate the difference map of the overlapping region and reestimate a stitching seam. This evaluation-reestimation procedure iterates until the current seam changes negligibly comparing with the previous seams. Experiments show that our proposed method can finally find a nearly perception-consistent seam after several iterations, which outperforms the conventional seam-cutting and other seam-driven methods.

* 5 pages, 4 figures

Via

Access Paper or Ask Questions

Quasi-homography warps in image stitching

Mar 18, 2018

Nan Li, Yifang Xu, Chao Wang

Figure 1 for Quasi-homography warps in image stitching

Figure 2 for Quasi-homography warps in image stitching

Figure 3 for Quasi-homography warps in image stitching

Figure 4 for Quasi-homography warps in image stitching

Abstract:The naturalness of warps is gaining extensive attentions in image stitching. Recent warps such as SPHP and AANAP, use global similarity warps to mitigate projective distortion (which enlarges regions), however, they necessarily bring in perspective distortion (which generates inconsistencies). In this paper, we propose a novel quasi-homography warp, which effectively balances the perspective distortion against the projective distortion in the non-overlapping region to create a more natural-looking panorama. Our approach formulates the warp as the solution of a bivariate system, where perspective distortion and projective distortion are characterized as slope preservation and scale linearization respectively. Because our proposed warp only relies on a global homography, thus it is totally parameter-free. A comprehensive experiment shows that a quasi-homography warp outperforms some state-of-the-art warps in urban scenes, including homography, AutoStitch and SPHP. A user study demonstrates that it wins most users' favor, comparing to homography and SPHP.

* 10 pages, 9 figures

Via

Access Paper or Ask Questions