Abstract:Deep neural networks (DNNs) provide state-of-the-art results for a multitude of applications, but the use of DNNs for multimodal audiovisual applications is still an unsolved problem. The current approaches that combine audiovisual information do not consider inherent uncertainty or leverage true classification confidence associated with each modality in the final decision. Our contribution in this work is to apply Bayesian variational inference to DNNs for audiovisual activity recognition and quantify model uncertainty along with principled confidence. We propose a novel approach that combines deterministic and variational layers to estimate model uncertainty and principled confidence. Our experiments with in- and out-of-distribution samples selected from a subset of the Moments-in-Time (MiT) dataset show more reliable confidence measure as compared to the non-Bayesian baseline. We also demonstrate the uncertainty estimates obtained from this framework can identify out-of-distribution data on the UCF101 and MiT datasets. In the multimodal setting, the proposed framework improved precision-recall AUC by 14.4% on the subset of MiT dataset as compared to non-Bayesian baseline.