Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Eric Sommerlade

ATS: Adaptive Token Sampling For Efficient Vision Transformers

Nov 30, 2021

Mohsen Fayyaz, Soroush Abbasi Kouhpayegani, Farnoush Rezaei Jafari, Eric Sommerlade, Hamid Reza Vaezi Joze, Hamed Pirsiavash, Juergen Gall

Figure 1 for ATS: Adaptive Token Sampling For Efficient Vision Transformers

Figure 2 for ATS: Adaptive Token Sampling For Efficient Vision Transformers

Figure 3 for ATS: Adaptive Token Sampling For Efficient Vision Transformers

Figure 4 for ATS: Adaptive Token Sampling For Efficient Vision Transformers

Abstract:While state-of-the-art vision transformer models achieve promising results for image classification, they are computationally very expensive and require many GFLOPs. Although the GFLOPs of a vision transformer can be decreased by reducing the number of tokens in the network, there is no setting that is optimal for all input images. In this work, we, therefore, introduce a differentiable parameter-free Adaptive Token Sampling (ATS) module, which can be plugged into any existing vision transformer architecture. ATS empowers vision transformers by scoring and adaptively sampling significant tokens. As a result, the number of tokens is not anymore static but it varies for each input image. By integrating ATS as an additional layer within current transformer blocks, we can convert them into much more efficient vision transformers with an adaptive number of tokens. Since ATS is a parameter-free module, it can be added to off-the-shelf pretrained vision transformers as a plug-and-play module, thus reducing their GFLOPs without any additional training. However, due to its differentiable design, one can also train a vision transformer equipped with ATS. We evaluate our module on the ImageNet dataset by adding it to multiple state-of-the-art vision transformers. Our evaluations show that the proposed module improves the state-of-the-art by reducing the computational cost (GFLOPs) by 37% while preserving the accuracy.

Via

Access Paper or Ask Questions

Cross-modal Spectrum Transformation Network For Acoustic Scene classification

Aug 13, 2021

Yang Liu, Alexandros Neophytou, Sunando Sengupta, Eric Sommerlade

Figure 1 for Cross-modal Spectrum Transformation Network For Acoustic Scene classification

Figure 2 for Cross-modal Spectrum Transformation Network For Acoustic Scene classification

Figure 3 for Cross-modal Spectrum Transformation Network For Acoustic Scene classification

Figure 4 for Cross-modal Spectrum Transformation Network For Acoustic Scene classification

Abstract:Convolutional neural networks (CNNs) with log-mel spectrum features have shown promising results for acoustic scene classification tasks. However, the performance of these CNN based classifiers is still lacking as they do not generalise well for unknown environments. To address this issue, we introduce an acoustic spectrum transformation network where traditional log-mel spectrums are transformed into imagined visual features (IVF). The imagined visual features are learned by exploiting the relationship between audio and visual features present in video recordings. An auto-encoder is used to encode images as visual features and a transformation network learns how to generate imagined visual features from log-mel. Our model is trained on a large dataset of Youtube videos. We test our proposed method on the scene classification task of DCASE and ESC-50, where our method outperforms other spectrum features, especially for unseen environments.

* ICASSP 2021

Via

Access Paper or Ask Questions

Relighting Images in the Wild with a Self-Supervised Siamese Auto-Encoder

Dec 11, 2020

Yang Liu, Alexandros Neophytou, Sunando Sengupta, Eric Sommerlade

Figure 1 for Relighting Images in the Wild with a Self-Supervised Siamese Auto-Encoder

Figure 2 for Relighting Images in the Wild with a Self-Supervised Siamese Auto-Encoder

Figure 3 for Relighting Images in the Wild with a Self-Supervised Siamese Auto-Encoder

Figure 4 for Relighting Images in the Wild with a Self-Supervised Siamese Auto-Encoder

Abstract:We propose a self-supervised method for image relighting of single view images in the wild. The method is based on an auto-encoder which deconstructs an image into two separate encodings, relating to the scene illumination and content, respectively. In order to disentangle this embedding information without supervision, we exploit the assumption that some augmentation operations do not affect the image content and only affect the direction of the light. A novel loss function, called spherical harmonic loss, is introduced that forces the illumination embedding to convert to a spherical harmonic vector. We train our model on large-scale datasets such as Youtube 8M and CelebA. Our experiments show that our method can correctly estimate scene illumination and realistically re-light input images, without any supervision or a prior shape model. Compared to supervised methods, our approach has similar performance and avoids common lighting artifacts.

Via

Access Paper or Ask Questions