Picture for Wei-Hong Chuang

Wei-Hong Chuang

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

Add code
Apr 22, 2021
Figure 1 for VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
Figure 2 for VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
Figure 3 for VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
Figure 4 for VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
Viaarxiv icon