Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xi Shen

LIGM

VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning

May 29, 2025

Liyun Zhu, Qixiang Chen, Xi Shen, Xiaodong Cun

Abstract:Video Anomaly Understanding (VAU) is essential for applications such as smart cities, security surveillance, and disaster alert systems, yet remains challenging due to its demand for fine-grained spatio-temporal perception and robust reasoning under ambiguity. Despite advances in anomaly detection, existing methods often lack interpretability and struggle to capture the causal and contextual aspects of abnormal events. This limitation is further compounded by the absence of comprehensive benchmarks for evaluating reasoning ability in anomaly scenarios. To address both challenges, we introduce VAU-R1, a data-efficient framework built upon Multimodal Large Language Models (MLLMs), which enhances anomaly reasoning through Reinforcement Fine-Tuning (RFT). Besides, we propose VAU-Bench, the first Chain-of-Thought benchmark tailored for video anomaly reasoning, featuring multiple-choice QA, detailed rationales, temporal annotations, and descriptive captions. Empirical results show that VAU-R1 significantly improves question answering accuracy, temporal grounding, and reasoning coherence across diverse contexts. Together, our method and benchmark establish a strong foundation for interpretable and reasoning-aware video anomaly understanding. Our code is available at https://github.com/GVCLab/VAU-R1.

Via

Access Paper or Ask Questions

Balanced Opto-electronic Joint Transform Correlator for Enhanced Real-Time Pattern Recognition

Mar 18, 2025

Julian Gamboa, Xi Shen, Tabassom Hamidfar, Shamima Mitu, Selim M. Shahriar

Abstract:Opto-electronic joint transform correlators (JTCs) use a focal plane array (FPA) to detect the joint power spectrum (JPS) of two input images, projecting it onto a spatial light modulator (SLM) to be optically Fourier transformed. The JPS is composed of two self-intensities and two conjugate-products, where only the latter produce the cross-correlation. However, the self-intensity terms are typically much stronger than the conjugate-products, consuming most of the available bit-depth on the FPA and SLM. Here we propose and demonstrate, through simulation and experiment, a balanced opto-electronic JTC that electronically processes the JPS to remove the self-intensity terms, thereby enhancing the quality of the cross-correlation result.

Via

Access Paper or Ask Questions

Shift, Scale and Rotation Invariant Multiple Object Detection using Balanced Joint Transform Correlator

Mar 18, 2025

Xi Shen, Julian Gamboa, Tabassom Hamidfar, Shamima Mitu, Selim M. Shahriar

Abstract:The Polar Mellin Transform (PMT) is a well-known technique that converts images into shift, scale and rotation invariant signatures for object detection using opto-electronic correlators. However, this technique cannot be properly applied when there are multiple targets in a single input. Here, we propose a Segmented PMT (SPMT) that extends this methodology for cases where multiple objects are present within the same frame. Simulations show that this SPMT can be integrated into an opto-electronic joint transform correlator to create a correlation system capable of detecting multiple objects simultaneously, presenting robust detection capabilities across various transformation conditions, with remarkable discrimination between matching and non-matching targets.

Via

Access Paper or Ask Questions

SimROD: A Simple Baseline for Raw Object Detection with Global and Local Enhancements

Mar 10, 2025

Haiyang Xie, Xi Shen, Shihua Huang, Zheng Wang

Abstract:Most visual models are designed for sRGB images, yet RAW data offers significant advantages for object detection by preserving sensor information before ISP processing. This enables improved detection accuracy and more efficient hardware designs by bypassing the ISP. However, RAW object detection is challenging due to limited training data, unbalanced pixel distributions, and sensor noise. To address this, we propose SimROD, a lightweight and effective approach for RAW object detection. We introduce a Global Gamma Enhancement (GGE) module, which applies a learnable global gamma transformation with only four parameters, improving feature representation while keeping the model efficient. Additionally, we leverage the green channel's richer signal to enhance local details, aligning with the human eye's sensitivity and Bayer filter design. Extensive experiments on multiple RAW object detection datasets and detectors demonstrate that SimROD outperforms state-of-the-art methods like RAW-Adapter and DIAP while maintaining efficiency. Our work highlights the potential of RAW data for real-world object detection.

Via

Access Paper or Ask Questions

Ultra-fast Real-time Target Recognition Using a Shift, Scale, and Rotation Invariant Hybrid Opto-electronic Joint Transform Correlator

Jan 31, 2025

Xi Shen, Julian Gamboa, Tabassom Hamidfar, Shamima A. Mitu, Selim M. Shahriar

Abstract:Hybrid Opto-electronic correlators (HOC) overcome many limitations of all-optical correlators (AOC) while maintaining high-speed operation. However, neither the OEC nor the AOC in their conventional configurations can detect targets that have been rotated or scaled relative to a reference. This can be addressed by using a polar Mellin transform (PMT) pre-processing step to convert input images into signatures that contain most of the relevant information, albeit represented in a shift, scale, and rotation invariant (SSRI) manner. The PMT requires the use of optics to perform the Fourier transform and electronics for a log-polar remapping step. Recently, we demonstrated a pipelined architecture that can perform the PMT at a speed of 720 frames per second (fps), enabling the construction of an efficient opto-electronic PMT pre-processor. Here, we present an experimental demonstration of a complete HOC that implements this technique to achieve real-time and ultra-fast SSRI target recognition for space situational awareness. For this demonstration, we make use of a modified version of the HOC that makes use of Joint Transform Correlation , thus rendering the system simpler and more compact.

* Advanced Maui Optical and Space Surveillance Technologies Conference (AMOS)

Via

Access Paper or Ask Questions

DEIM: DETR with Improved Matching for Fast Convergence

Dec 05, 2024

Shihua Huang, Zhichao Lu, Xiaodong Cun, Yongjun Yu, Xiao Zhou, Xi Shen

Figure 1 for DEIM: DETR with Improved Matching for Fast Convergence

Figure 2 for DEIM: DETR with Improved Matching for Fast Convergence

Figure 3 for DEIM: DETR with Improved Matching for Fast Convergence

Figure 4 for DEIM: DETR with Improved Matching for Fast Convergence

Abstract:We introduce DEIM, an innovative and efficient training framework designed to accelerate convergence in real-time object detection with Transformer-based architectures (DETR). To mitigate the sparse supervision inherent in one-to-one (O2O) matching in DETR models, DEIM employs a Dense O2O matching strategy. This approach increases the number of positive samples per image by incorporating additional targets, using standard data augmentation techniques. While Dense O2O matching speeds up convergence, it also introduces numerous low-quality matches that could affect performance. To address this, we propose the Matchability-Aware Loss (MAL), a novel loss function that optimizes matches across various quality levels, enhancing the effectiveness of Dense O2O. Extensive experiments on the COCO dataset validate the efficacy of DEIM. When integrated with RT-DETR and D-FINE, it consistently boosts performance while reducing training time by 50%. Notably, paired with RT-DETRv2, DEIM achieves 53.2% AP in a single day of training on an NVIDIA 4090 GPU. Additionally, DEIM-trained real-time models outperform leading real-time object detectors, with DEIM-D-FINE-L and DEIM-D-FINE-X achieving 54.7% and 56.5% AP at 124 and 78 FPS on an NVIDIA T4 GPU, respectively, without the need for additional data. We believe DEIM sets a new baseline for advancements in real-time object detection. Our code and pre-trained models are available at https://github.com/ShihuaHuang95/DEIM.

* Exceeding all existing real-time object detectors, including YOLOv11 and D-FINE

Via

Access Paper or Ask Questions

ForgeryTTT: Zero-Shot Image Manipulation Localization with Test-Time Training

Oct 05, 2024

Weihuang Liu, Xi Shen, Chi-Man Pun, Xiaodong Cun

Figure 1 for ForgeryTTT: Zero-Shot Image Manipulation Localization with Test-Time Training

Figure 2 for ForgeryTTT: Zero-Shot Image Manipulation Localization with Test-Time Training

Figure 3 for ForgeryTTT: Zero-Shot Image Manipulation Localization with Test-Time Training

Figure 4 for ForgeryTTT: Zero-Shot Image Manipulation Localization with Test-Time Training

Abstract:Social media is increasingly plagued by realistic fake images, making it hard to trust content. Previous algorithms to detect these fakes often fail in new, real-world scenarios because they are trained on specific datasets. To address the problem, we introduce ForgeryTTT, the first method leveraging test-time training (TTT) to identify manipulated regions in images. The proposed approach fine-tunes the model for each individual test sample, improving its performance. ForgeryTTT first employs vision transformers as a shared image encoder to learn both classification and localization tasks simultaneously during the training-time training using a large synthetic dataset. Precisely, the localization head predicts a mask to highlight manipulated areas. Given such a mask, the input tokens can be divided into manipulated and genuine groups, which are then fed into the classification head to distinguish between manipulated and genuine parts. During test-time training, the predicted mask from the localization head is used for the classification head to update the image encoder for better adaptation. Additionally, using the classical dropout strategy in each token group significantly improves performance and efficiency. We test ForgeryTTT on five standard benchmarks. Despite its simplicity, ForgeryTTT achieves a 20.1% improvement in localization accuracy compared to other zero-shot methods and a 4.3% improvement over non-zero-shot techniques. Our code and data will be released upon publication.

* Technical Report

Via

Access Paper or Ask Questions

HTR-VT: Handwritten Text Recognition with Vision Transformer

Sep 13, 2024

Yuting Li, Dexiong Chen, Tinglong Tang, Xi Shen

Abstract:We explore the application of Vision Transformer (ViT) for handwritten text recognition. The limited availability of labeled data in this domain poses challenges for achieving high performance solely relying on ViT. Previous transformer-based models required external data or extensive pre-training on large datasets to excel. To address this limitation, we introduce a data-efficient ViT method that uses only the encoder of the standard transformer. We find that incorporating a Convolutional Neural Network (CNN) for feature extraction instead of the original patch embedding and employ Sharpness-Aware Minimization (SAM) optimizer to ensure that the model can converge towards flatter minima and yield notable enhancements. Furthermore, our introduction of the span mask technique, which masks interconnected features in the feature map, acts as an effective regularizer. Empirically, our approach competes favorably with traditional CNN-based models on small datasets like IAM and READ2016. Additionally, it establishes a new benchmark on the LAM dataset, currently the largest dataset with 19,830 training text lines. The code is publicly available at: https://github.com/YutingLi0606/HTR-VT.

* Accepted to Pattern Recognition

Via

Access Paper or Ask Questions

From COCO to COCO-FP: A Deep Dive into Background False Positives for COCO Detectors

Sep 12, 2024

Longfei Liu, Wen Guo, Shihua Huang, Cheng Li, Xi Shen

Figure 1 for From COCO to COCO-FP: A Deep Dive into Background False Positives for COCO Detectors

Figure 2 for From COCO to COCO-FP: A Deep Dive into Background False Positives for COCO Detectors

Figure 3 for From COCO to COCO-FP: A Deep Dive into Background False Positives for COCO Detectors

Figure 4 for From COCO to COCO-FP: A Deep Dive into Background False Positives for COCO Detectors

Abstract:Reducing false positives is essential for enhancing object detector performance, as reflected in the mean Average Precision (mAP) metric. Although object detectors have achieved notable improvements and high mAP scores on the COCO dataset, analysis reveals limited progress in addressing false positives caused by non-target visual clutter-background objects not included in the annotated categories. This issue is particularly critical in real-world applications, such as fire and smoke detection, where minimizing false alarms is crucial. In this study, we introduce COCO-FP, a new evaluation dataset derived from the ImageNet-1K dataset, designed to address this issue. By extending the original COCO validation dataset, COCO-FP specifically assesses object detectors' performance in mitigating background false positives. Our evaluation of both standard and advanced object detectors shows a significant number of false positives in both closed-set and open-set scenarios. For example, the AP50 metric for YOLOv9-E decreases from 72.8 to 65.7 when shifting from COCO to COCO-FP. The dataset is available at https://github.com/COCO-FP/COCO-FP.

Via

Access Paper or Ask Questions

Technique Report of CVPR 2024 PBDL Challenges

Jun 15, 2024

Ying Fu, Yu Li, Shaodi You, Boxin Shi, Jose Alvarez, Coert van Gemeren, Linwei Chen, Yunhao Zou, Zichun Wang, Yichen Li(+91 more)

Figure 1 for Technique Report of CVPR 2024 PBDL Challenges

Figure 2 for Technique Report of CVPR 2024 PBDL Challenges

Figure 3 for Technique Report of CVPR 2024 PBDL Challenges

Figure 4 for Technique Report of CVPR 2024 PBDL Challenges

Abstract:The intersection of physics-based vision and deep learning presents an exciting frontier for advancing computer vision technologies. By leveraging the principles of physics to inform and enhance deep learning models, we can develop more robust and accurate vision systems. Physics-based vision aims to invert the processes to recover scene properties such as shape, reflectance, light distribution, and medium properties from images. In recent years, deep learning has shown promising improvements for various vision tasks, and when combined with physics-based vision, these approaches can enhance the robustness and accuracy of vision systems. This technical report summarizes the outcomes of the Physics-Based Vision Meets Deep Learning (PBDL) 2024 challenge, held in CVPR 2024 workshop. The challenge consisted of eight tracks, focusing on Low-Light Enhancement and Detection as well as High Dynamic Range (HDR) Imaging. This report details the objectives, methodologies, and results of each track, highlighting the top-performing solutions and their innovative approaches.

* CVPR 2024 Workshop - PBDL Challenge Report

Via

Access Paper or Ask Questions