Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yujia Sun

AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection

Apr 06, 2025

Peng Wu, Wanshun Su, Guansong Pang, Yujia Sun, Qingsen Yan, Peng Wang, Yanning Zhang

Abstract:With the increasing adoption of video anomaly detection in intelligent surveillance domains, conventional visual-based detection approaches often struggle with information insufficiency and high false-positive rates in complex environments. To address these limitations, we present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection. Capitalizing on the exceptional cross-modal representation learning capabilities of Contrastive Language-Image Pretraining (CLIP) across visual, audio, and textual domains, our framework introduces two major innovations: an efficient audio-visual fusion that enables adaptive cross-modal integration through lightweight parametric adaptation while maintaining the frozen CLIP backbone, and a novel audio-visual prompt that dynamically enhances text embeddings with key multimodal information based on the semantic correlation between audio-visual features and textual labels, significantly improving CLIP's generalization for the video anomaly detection task. Moreover, to enhance robustness against modality deficiency during inference, we further develop an uncertainty-driven feature distillation module that synthesizes audio-visual representations from visual-only inputs. This module employs uncertainty modeling based on the diversity of audio-visual features to dynamically emphasize challenging features during the distillation process. Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy in various scenarios. Notably, with unimodal data enhanced by uncertainty-driven distillation, our approach consistently outperforms current unimodal VAD methods.

* 11 pages, 4 figures, 6 tables

Via

Access Paper or Ask Questions

Revisiting Acoustic Similarity in Emotional Speech and Music via Self-Supervised Representations

Sep 26, 2024

Yujia Sun, Zeyu Zhao, Korin Richmond, Yuanchao Li

Figure 1 for Revisiting Acoustic Similarity in Emotional Speech and Music via Self-Supervised Representations

Figure 2 for Revisiting Acoustic Similarity in Emotional Speech and Music via Self-Supervised Representations

Figure 3 for Revisiting Acoustic Similarity in Emotional Speech and Music via Self-Supervised Representations

Figure 4 for Revisiting Acoustic Similarity in Emotional Speech and Music via Self-Supervised Representations

Abstract:Emotion recognition from speech and music shares similarities due to their acoustic overlap, which has led to interest in transferring knowledge between these domains. However, the shared acoustic cues between speech and music, particularly those encoded by Self-Supervised Learning (SSL) models, remain largely unexplored, given the fact that SSL models for speech and music have rarely been applied in cross-domain research. In this work, we revisit the acoustic similarity between emotion speech and music, starting with an analysis of the layerwise behavior of SSL models for Speech Emotion Recognition (SER) and Music Emotion Recognition (MER). Furthermore, we perform cross-domain adaptation by comparing several approaches in a two-stage fine-tuning process, examining effective ways to utilize music for SER and speech for MER. Lastly, we explore the acoustic similarities between emotional speech and music using Frechet audio distance for individual emotions, uncovering the issue of emotion bias in both speech and music SSL models. Our findings reveal that while speech and music SSL models do capture shared acoustic features, their behaviors can vary depending on different emotions due to their training strategies and domain-specificities. Additionally, parameter-efficient fine-tuning can enhance SER and MER performance by leveraging knowledge from each other. This study provides new insights into the acoustic similarity between emotional speech and music, and highlights the potential for cross-domain generalization to improve SER and MER systems.

Via

Access Paper or Ask Questions

Open-Vocabulary Video Anomaly Detection

Nov 15, 2023

Peng Wu, Xuerong Zhou, Guansong Pang, Yujia Sun, Jing Liu, Peng Wang, Yanning Zhang

Abstract:Video anomaly detection (VAD) with weak supervision has achieved remarkable performance in utilizing video-level labels to discriminate whether a video frame is normal or abnormal. However, current approaches are inherently limited to a closed-set setting and may struggle in open-world applications where there can be anomaly categories in the test data unseen during training. A few recent studies attempt to tackle a more realistic setting, open-set VAD, which aims to detect unseen anomalies given seen anomalies and normal videos. However, such a setting focuses on predicting frame anomaly scores, having no ability to recognize the specific categories of anomalies, despite the fact that this ability is essential for building more informed video surveillance systems. This paper takes a step further and explores open-vocabulary video anomaly detection (OVVAD), in which we aim to leverage pre-trained large models to detect and categorize seen and unseen anomalies. To this end, we propose a model that decouples OVVAD into two mutually complementary tasks -- class-agnostic detection and class-specific classification -- and jointly optimizes both tasks. Particularly, we devise a semantic knowledge injection module to introduce semantic knowledge from large language models for the detection task, and design a novel anomaly synthesis module to generate pseudo unseen anomaly videos with the help of large vision generation models for the classification task. These semantic knowledge and synthesis anomalies substantially extend our model's capability in detecting and categorizing a variety of seen and unseen anomalies. Extensive experiments on three widely-used benchmarks demonstrate our model achieves state-of-the-art performance on OVVAD task.

* Submitted

Via

Access Paper or Ask Questions

Boundary-Guided Camouflaged Object Detection

Jul 02, 2022

Yujia Sun, Shuo Wang, Chenglizhao Chen, Tian-Zhu Xiang

Figure 1 for Boundary-Guided Camouflaged Object Detection

Figure 2 for Boundary-Guided Camouflaged Object Detection

Figure 3 for Boundary-Guided Camouflaged Object Detection

Figure 4 for Boundary-Guided Camouflaged Object Detection

Abstract:Camouflaged object detection (COD), segmenting objects that are elegantly blended into their surroundings, is a valuable yet challenging task. Existing deep-learning methods often fall into the difficulty of accurately identifying the camouflaged object with complete and fine object structure. To this end, in this paper, we propose a novel boundary-guided network (BGNet) for camouflaged object detection. Our method explores valuable and extra object-related edge semantics to guide representation learning of COD, which forces the model to generate features that highlight object structure, thereby promoting camouflaged object detection of accurate boundary localization. Extensive experiments on three challenging benchmark datasets demonstrate that our BGNet significantly outperforms the existing 18 state-of-the-art methods under four widely-used evaluation metrics. Our code is publicly available at: https://github.com/thograce/BGNet.

* IJCAI2022
* Accepted by IJCAI2022

Via

Access Paper or Ask Questions

Context-aware Cross-level Fusion Network for Camouflaged Object Detection

May 26, 2021

Yujia Sun, Geng Chen, Tao Zhou, Yi Zhang, Nian Liu

Figure 1 for Context-aware Cross-level Fusion Network for Camouflaged Object Detection

Figure 2 for Context-aware Cross-level Fusion Network for Camouflaged Object Detection

Figure 3 for Context-aware Cross-level Fusion Network for Camouflaged Object Detection

Figure 4 for Context-aware Cross-level Fusion Network for Camouflaged Object Detection

Abstract:Camouflaged object detection (COD) is a challenging task due to the low boundary contrast between the object and its surroundings. In addition, the appearance of camouflaged objects varies significantly, e.g., object size and shape, aggravating the difficulties of accurate COD. In this paper, we propose a novel Context-aware Cross-level Fusion Network (C2F-Net) to address the challenging COD task. Specifically, we propose an Attention-induced Cross-level Fusion Module (ACFM) to integrate the multi-level features with informative attention coefficients. The fused features are then fed to the proposed Dual-branch Global Context Module (DGCM), which yields multi-scale feature representations for exploiting rich global context information. In C2F-Net, the two modules are conducted on high-level features using a cascaded manner. Extensive experiments on three widely used benchmark datasets demonstrate that our C2F-Net is an effective COD model and outperforms state-of-the-art models remarkably. Our code is publicly available at: https://github.com/thograce/C2FNet.

* 7 pages, 4 figures. Accepted by IJCAI-2021

Via

Access Paper or Ask Questions

Learning Synergistic Attention for Light Field Salient Object Detection

May 16, 2021

Yi Zhang, Geng Chen, Qian Chen, Yujia Sun, Olivier Deforges, Wassim Hamidouche, Lu Zhang

Figure 1 for Learning Synergistic Attention for Light Field Salient Object Detection

Figure 2 for Learning Synergistic Attention for Light Field Salient Object Detection

Figure 3 for Learning Synergistic Attention for Light Field Salient Object Detection

Figure 4 for Learning Synergistic Attention for Light Field Salient Object Detection

Abstract:We propose a novel Synergistic Attention Network (SA-Net) to address the light field salient object detection by establishing a synergistic effect between multi-modal features with advanced attention mechanisms. Our SA-Net exploits the rich information of focal stacks via 3D convolutional neural networks, decodes the high-level features of multi-modal light field data with two cascaded synergistic attention modules, and predicts the saliency map using an effective feature fusion module in a progressive manner. Extensive experiments on three widely-used benchmark datasets show that our SA-Net outperforms 28 state-of-the-art models, sufficiently demonstrating its effectiveness and superiority. Our code will be made publicly available.

* 14 pages, 12 figures

Via

Access Paper or Ask Questions

Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision

Jul 13, 2020

Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, Zhiwei Yang

Figure 1 for Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision

Figure 2 for Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision

Figure 3 for Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision

Figure 4 for Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision

Abstract:Violence detection has been studied in computer vision for years. However, previous work are either superficial, e.g., classification of short-clips, and the single scenario, or undersupplied, e.g., the single modality, and hand-crafted features based multimodality. To address this problem, in this work we first release a large-scale and multi-scene dataset named XD-Violence with a total duration of 217 hours, containing 4754 untrimmed videos with audio signals and weak labels. Then we propose a neural network containing three parallel branches to capture different relations among video snippets and integrate features, where holistic branch captures long-range dependencies using similarity prior, localized branch captures local positional relation using proximity prior, and score branch dynamically captures the closeness of predicted score. Besides, our method also includes an approximator to meet the needs of online detection. Our method outperforms other state-of-the-art methods on our released dataset and other existing benchmark. Moreover, extensive experimental results also show the positive effect of multimodal (audio-visual) input and modeling relationships. The code and dataset will be released in https://roc-ng.github.io/XD-Violence/.

* To appear in ECCV 2020

Via

Access Paper or Ask Questions