Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Lisen Dai

SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering

Nov 07, 2024

ianyu Yang, Yiyang Nan, Lisen Dai, Zhenwen Liang, Yapeng Tian, Xiangliang Zhang

Figure 1 for SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering

Figure 2 for SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering

Figure 3 for SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering

Figure 4 for SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering

Abstract:Audio-Visual Question Answering (AVQA) is a challenging task that involves answering questions based on both auditory and visual information in videos. A significant challenge is interpreting complex multi-modal scenes, which include both visual objects and sound sources, and connecting them to the given question. In this paper, we introduce the Source-aware Semantic Representation Network (SaSR-Net), a novel model designed for AVQA. SaSR-Net utilizes source-wise learnable tokens to efficiently capture and align audio-visual elements with the corresponding question. It streamlines the fusion of audio and visual information using spatial and temporal attention mechanisms to identify answers in multi-modal scenes. Extensive experiments on the Music-AVQA and AVQA-Yang datasets show that SaSR-Net outperforms state-of-the-art AVQA methods.

* EMNLP 2024

Via

Access Paper or Ask Questions

CLIPErase: Efficient Unlearning of Visual-Textual Associations in CLIP

Oct 30, 2024

Tianyu Yang, Lisen Dai, Zheyuan Liu, Xiangqi Wang, Meng Jiang, Yapeng Tian, Xiangliang Zhang

Figure 1 for CLIPErase: Efficient Unlearning of Visual-Textual Associations in CLIP

Figure 2 for CLIPErase: Efficient Unlearning of Visual-Textual Associations in CLIP

Figure 3 for CLIPErase: Efficient Unlearning of Visual-Textual Associations in CLIP

Figure 4 for CLIPErase: Efficient Unlearning of Visual-Textual Associations in CLIP

Abstract:Machine unlearning (MU) has gained significant attention as a means to remove specific data from trained models without requiring a full retraining process. While progress has been made in unimodal domains like text and image classification, unlearning in multimodal models remains relatively underexplored. In this work, we address the unique challenges of unlearning in CLIP, a prominent multimodal model that aligns visual and textual representations. We introduce CLIPErase, a novel approach that disentangles and selectively forgets both visual and textual associations, ensuring that unlearning does not compromise model performance. CLIPErase consists of three key modules: a Forgetting Module that disrupts the associations in the forget set, a Retention Module that preserves performance on the retain set, and a Consistency Module that maintains consistency with the original model. Extensive experiments on the CIFAR-100 and Flickr30K datasets across four CLIP downstream tasks demonstrate that CLIPErase effectively forgets designated associations in zero-shot tasks for multimodal samples, while preserving the model's performance on the retain set after unlearning.

Via

Access Paper or Ask Questions