Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Noel Vouitsis

TR0N: Translator Networks for 0-Shot Plug-and-Play Conditional Generation

Apr 26, 2023

Zhaoyan Liu, Noel Vouitsis, Satya Krishna Gorti, Jimmy Ba, Gabriel Loaiza-Ganem

Figure 1 for TR0N: Translator Networks for 0-Shot Plug-and-Play Conditional Generation

Figure 2 for TR0N: Translator Networks for 0-Shot Plug-and-Play Conditional Generation

Figure 3 for TR0N: Translator Networks for 0-Shot Plug-and-Play Conditional Generation

Figure 4 for TR0N: Translator Networks for 0-Shot Plug-and-Play Conditional Generation

Abstract:We propose TR0N, a highly general framework to turn pre-trained unconditional generative models, such as GANs and VAEs, into conditional models. The conditioning can be highly arbitrary, and requires only a pre-trained auxiliary model. For example, we show how to turn unconditional models into class-conditional ones with the help of a classifier, and also into text-to-image models by leveraging CLIP. TR0N learns a lightweight stochastic mapping which "translates" between the space of conditions and the latent space of the generative model, in such a way that the generated latent corresponds to a data sample satisfying the desired condition. The translated latent samples are then further improved upon through Langevin dynamics, enabling us to obtain higher-quality data samples. TR0N requires no training data nor fine-tuning, yet can achieve a zero-shot FID of 10.9 on MS-COCO, outperforming competing alternatives not only on this metric, but also in sampling speed -- all while retaining a much higher level of generality. Our code is available at https://github.com/layer6ai-labs/tr0n.

* Accepted at ICML 2023

Via

Access Paper or Ask Questions

X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval

Mar 28, 2022

Satya Krishna Gorti, Noel Vouitsis, Junwei Ma, Keyvan Golestan, Maksims Volkovs, Animesh Garg, Guangwei Yu

Figure 1 for X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval

Figure 2 for X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval

Figure 3 for X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval

Figure 4 for X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval

Abstract:In text-video retrieval, the objective is to learn a cross-modal similarity function between a text and a video that ranks relevant text-video pairs higher than irrelevant pairs. However, videos inherently express a much wider gamut of information than texts. Instead, texts often capture sub-regions of entire videos and are most semantically similar to certain frames within videos. Therefore, for a given text, a retrieval model should focus on the text's most semantically similar video sub-regions to make a more relevant comparison. Yet, most existing works aggregate entire videos without directly considering text. Common text-agnostic aggregations schemes include mean-pooling or self-attention over the frames, but these are likely to encode misleading visual information not described in the given text. To address this, we propose a cross-modal attention model called X-Pool that reasons between a text and the frames of a video. Our core mechanism is a scaled dot product attention for a text to attend to its most semantically similar frames. We then generate an aggregated video representation conditioned on the text's attention weights over the frames. We evaluate our method on three benchmark datasets of MSR-VTT, MSVD and LSMDC, achieving new state-of-the-art results by up to 12% in relative improvement in Recall@1. Our findings thereby highlight the importance of joint text-video reasoning to extract important visual cues according to text. Full code and demo can be found at: https://layer6ai-labs.github.io/xpool/

* CVPR 2022

Via

Access Paper or Ask Questions