Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ali Gholami

AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders

Nov 16, 2022

Wele Gedara Chaminda Bandara, Naman Patel, Ali Gholami, Mehdi Nikkhah, Motilal Agrawal, Vishal M. Patel

Figure 1 for AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders

Figure 2 for AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders

Figure 3 for AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders

Figure 4 for AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders

Abstract:Masked Autoencoders (MAEs) learn generalizable representations for image, text, audio, video, etc., by reconstructing masked input data from tokens of the visible data. Current MAE approaches for videos rely on random patch, tube, or frame-based masking strategies to select these tokens. This paper proposes AdaMAE, an adaptive masking strategy for MAEs that is end-to-end trainable. Our adaptive masking strategy samples visible tokens based on the semantic context using an auxiliary sampling network. This network estimates a categorical distribution over spacetime-patch tokens. The tokens that increase the expected reconstruction error are rewarded and selected as visible tokens, motivated by the policy gradient algorithm in reinforcement learning. We show that AdaMAE samples more tokens from the high spatiotemporal information regions, thereby allowing us to mask 95% of tokens, resulting in lower memory requirements and faster pre-training. We conduct ablation studies on the Something-Something v2 (SSv2) dataset to demonstrate the efficacy of our adaptive sampling approach and report state-of-the-art results of 70.0% and 81.7% in top-1 accuracy on SSv2 and Kinetics-400 action classification datasets with a ViT-Base backbone and 800 pre-training epochs.

* Code available at: https://github.com/wgcban/adamae

Via

Access Paper or Ask Questions

MaskTune: Mitigating Spurious Correlations by Forcing to Explore

Oct 08, 2022

Saeid Asgari Taghanaki, Aliasghar Khani, Fereshte Khani, Ali Gholami, Linh Tran, Ali Mahdavi-Amiri, Ghassan Hamarneh

Figure 1 for MaskTune: Mitigating Spurious Correlations by Forcing to Explore

Figure 2 for MaskTune: Mitigating Spurious Correlations by Forcing to Explore

Figure 3 for MaskTune: Mitigating Spurious Correlations by Forcing to Explore

Figure 4 for MaskTune: Mitigating Spurious Correlations by Forcing to Explore

Abstract:A fundamental challenge of over-parameterized deep learning models is learning meaningful data representations that yield good performance on a downstream task without over-fitting spurious input features. This work proposes MaskTune, a masking strategy that prevents over-reliance on spurious (or a limited number of) features. MaskTune forces the trained model to explore new features during a single epoch finetuning by masking previously discovered features. MaskTune, unlike earlier approaches for mitigating shortcut learning, does not require any supervision, such as annotating spurious features or labels for subgroup samples in a dataset. Our empirical results on biased MNIST, CelebA, Waterbirds, and ImagenNet-9L datasets show that MaskTune is effective on tasks that often suffer from the existence of spurious correlations. Finally, we show that MaskTune outperforms or achieves similar performance to the competing methods when applied to the selective classification (classification with rejection option) task. Code for MaskTune is available at https://github.com/aliasgharkhani/Masktune.

* Accepted to NeurIPS 2022

Via

Access Paper or Ask Questions

Counterbalancing Teacher: Regularizing Batch Normalized Models for Robustness

Jul 04, 2022

Saeid Asgari Taghanaki, Ali Gholami, Fereshte Khani, Kristy Choi, Linh Tran, Ran Zhang, Aliasghar Khani

Figure 1 for Counterbalancing Teacher: Regularizing Batch Normalized Models for Robustness

Figure 2 for Counterbalancing Teacher: Regularizing Batch Normalized Models for Robustness

Figure 3 for Counterbalancing Teacher: Regularizing Batch Normalized Models for Robustness

Figure 4 for Counterbalancing Teacher: Regularizing Batch Normalized Models for Robustness

Abstract:Batch normalization (BN) is a ubiquitous technique for training deep neural networks that accelerates their convergence to reach higher accuracy. However, we demonstrate that BN comes with a fundamental drawback: it incentivizes the model to rely on low-variance features that are highly specific to the training (in-domain) data, hurting generalization performance on out-of-domain examples. In this work, we investigate this phenomenon by first showing that removing BN layers across a wide range of architectures leads to lower out-of-domain and corruption errors at the cost of higher in-domain errors. We then propose Counterbalancing Teacher (CT), a method which leverages a frozen copy of the same model without BN as a teacher to enforce the student network's learning of robust representations by substantially adapting its weights through a consistency loss function. This regularization signal helps CT perform well in unforeseen data shifts, even without information from the target domain as in prior works. We theoretically show in an overparameterized linear regression setting why normalization leads to a model's reliance on such in-domain features, and empirically demonstrate the efficacy of CT by outperforming several baselines on robustness benchmarks such as CIFAR-10-C, CIFAR-100-C, and VLCS.

Via

Access Paper or Ask Questions

Scan2Cap: Context-aware Dense Captioning in RGB-D Scans

Dec 03, 2020

Dave Zhenyu Chen, Ali Gholami, Matthias Nießner, Angel X. Chang

Figure 1 for Scan2Cap: Context-aware Dense Captioning in RGB-D Scans

Figure 2 for Scan2Cap: Context-aware Dense Captioning in RGB-D Scans

Figure 3 for Scan2Cap: Context-aware Dense Captioning in RGB-D Scans

Figure 4 for Scan2Cap: Context-aware Dense Captioning in RGB-D Scans

Abstract:We introduce the task of dense captioning in 3D scans from commodity RGB-D sensors. As input, we assume a point cloud of a 3D scene; the expected output is the bounding boxes along with the descriptions for the underlying objects. To address the 3D object detection and description problems, we propose Scan2Cap, an end-to-end trained method, to detect objects in the input scene and describe them in natural language. We use an attention mechanism that generates descriptive tokens while referring to the related components in the local context. To reflect object relations (i.e. relative spatial relations) in the generated captions, we use a message passing graph module to facilitate learning object relation features. Our method can effectively localize and describe 3D objects in scenes from the ScanRefer dataset, outperforming 2D baseline methods by a significant margin (27.61% CiDEr@0.5IoUimprovement).

* Video: https://youtu.be/AgmIpDbwTCY

Via

Access Paper or Ask Questions