Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yajian Wang

A study on joint modeling and data augmentation of multi-modalities for audio-visual scene classification

Mar 31, 2022

Qing Wang, Jun Du, Siyuan Zheng, Yunqing Li, Yajian Wang, Yuzhong Wu, Hu Hu, Chao-Han Huck Yang, Sabato Marco Siniscalchi, Yannan Wang(+1 more)

Figure 1 for A study on joint modeling and data augmentation of multi-modalities for audio-visual scene classification

Figure 2 for A study on joint modeling and data augmentation of multi-modalities for audio-visual scene classification

Figure 3 for A study on joint modeling and data augmentation of multi-modalities for audio-visual scene classification

Figure 4 for A study on joint modeling and data augmentation of multi-modalities for audio-visual scene classification

Abstract:In this paper, we propose two techniques, namely joint modeling and data augmentation, to improve system performances for audio-visual scene classification (AVSC). We employ pre-trained networks trained only on image data sets to extract video embedding; whereas for audio embedding models, we decide to train them from scratch. We explore different neural network architectures for joint modeling to effectively combine the video and audio modalities. Moreover, data augmentation strategies are investigated to increase audio-visual training set size. For the video modality the effectiveness of several operations in RandAugment is verified. An audio-video joint mixup scheme is proposed to further improve AVSC performances. Evaluated on the development set of TAU Urban Audio Visual Scenes 2021, our final system can achieve the best accuracy of 94.2% among all single AVSC systems submitted to DCASE 2021 Task 1b.

* 5 pages, 1 figure, submitted to INTERSPEECH 2022

Via

Access Paper or Ask Questions

A Two-Stage Approach to Device-Robust Acoustic Scene Classification

Nov 03, 2020

Hu Hu, Chao-Han Huck Yang, Xianjun Xia, Xue Bai, Xin Tang, Yajian Wang, Shutong Niu, Li Chai, Juanjuan Li, Hongning Zhu(+6 more)

Figure 1 for A Two-Stage Approach to Device-Robust Acoustic Scene Classification

Figure 2 for A Two-Stage Approach to Device-Robust Acoustic Scene Classification

Figure 3 for A Two-Stage Approach to Device-Robust Acoustic Scene Classification

Figure 4 for A Two-Stage Approach to Device-Robust Acoustic Scene Classification

Abstract:To improve device robustness, a highly desirable key feature of a competitive data-driven acoustic scene classification (ASC) system, a novel two-stage system based on fully convolutional neural networks (CNNs) is proposed. Our two-stage system leverages on an ad-hoc score combination based on two CNN classifiers: (i) the first CNN classifies acoustic inputs into one of three broad classes, and (ii) the second CNN classifies the same inputs into one of ten finer-grained classes. Three different CNN architectures are explored to implement the two-stage classifiers, and a frequency sub-sampling scheme is investigated. Moreover, novel data augmentation schemes for ASC are also investigated. Evaluated on DCASE 2020 Task 1a, our results show that the proposed ASC system attains a state-of-the-art accuracy on the development set, where our best system, a two-stage fusion of CNN ensembles, delivers a 81.9% average accuracy among multi-device test data, and it obtains a significant improvement on unseen devices. Finally, neural saliency analysis with class activation mapping (CAM) gives new insights on the patterns learnt by our models.

* Submitted to ICASSP 2021. Code available: https://github.com/MihawkHu/DCASE2020_task1

Via

Access Paper or Ask Questions

Device-Robust Acoustic Scene Classification Based on Two-Stage Categorization and Data Augmentation

Aug 27, 2020

Hu Hu, Chao-Han Huck Yang, Xianjun Xia, Xue Bai, Xin Tang, Yajian Wang, Shutong Niu, Li Chai, Juanjuan Li, Hongning Zhu(+6 more)

Figure 1 for Device-Robust Acoustic Scene Classification Based on Two-Stage Categorization and Data Augmentation

Figure 2 for Device-Robust Acoustic Scene Classification Based on Two-Stage Categorization and Data Augmentation

Figure 3 for Device-Robust Acoustic Scene Classification Based on Two-Stage Categorization and Data Augmentation

Abstract:In this technical report, we present a joint effort of four groups, namely GT, USTC, Tencent, and UKE, to tackle Task 1 - Acoustic Scene Classification (ASC) in the DCASE 2020 Challenge. Task 1 comprises two different sub-tasks: (i) Task 1a focuses on ASC of audio signals recorded with multiple (real and simulated) devices into ten different fine-grained classes, and (ii) Task 1b concerns with classification of data into three higher-level classes using low-complexity solutions. For Task 1a, we propose a novel two-stage ASC system leveraging upon ad-hoc score combination of two convolutional neural networks (CNNs), classifying the acoustic input according to three classes, and then ten classes, respectively. Four different CNN-based architectures are explored to implement the two-stage classifiers, and several data augmentation techniques are also investigated. For Task 1b, we leverage upon a quantization method to reduce the complexity of two of our top-accuracy three-classes CNN-based architectures. On Task 1a development data set, an ASC accuracy of 76.9\% is attained using our best single classifier and data augmentation. An accuracy of 81.9\% is then attained by a final model fusion of our two-stage ASC classifiers. On Task 1b development data set, we achieve an accuracy of 96.7\% with a model size smaller than 500KB. Code is available: https://github.com/MihawkHu/DCASE2020_task1.

* Revised Technical Report. Proposed systems attain 2nds in both Task-1a and Task-1b in the official DCASE challenge 2020

Via

Access Paper or Ask Questions