Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:T-EMDE: Sketching-based global similarity for cross-modal retrieval

May 10, 2021

Barbara Rychalska, Mikolaj Wieczorek, Jacek Dabrowski

Figure 1 for T-EMDE: Sketching-based global similarity for cross-modal retrieval

Figure 2 for T-EMDE: Sketching-based global similarity for cross-modal retrieval

Figure 3 for T-EMDE: Sketching-based global similarity for cross-modal retrieval

Figure 4 for T-EMDE: Sketching-based global similarity for cross-modal retrieval

Share this with someone who'll enjoy it:

Abstract:The key challenge in cross-modal retrieval is to find similarities between objects represented with different modalities, such as image and text. However, each modality embeddings stem from non-related feature spaces, which causes the notorious 'heterogeneity gap'. Currently, many cross-modal systems try to bridge the gap with self-attention. However, self-attention has been widely criticized for its quadratic complexity, which prevents many real-life applications. In response to this, we propose T-EMDE - a neural density estimator inspired by the recently introduced Efficient Manifold Density Estimator (EMDE) from the area of recommender systems. EMDE operates on sketches - representations especially suitable for multimodal operations. However, EMDE is non-differentiable and ingests precomputed, static embeddings. With T-EMDE we introduce a trainable version of EMDE which allows full end-to-end training. In contrast to self-attention, the complexity of our solution is linear to the number of tokens/segments. As such, T-EMDE is a drop-in replacement for the self-attention module, with beneficial influence on both speed and metric performance in cross-modal settings. It facilitates communication between modalities, as each global text/image representation is expressed with a standardized sketch histogram which represents the same manifold structures irrespective of the underlying modality. We evaluate T-EMDE by introducing it into two recent cross-modal SOTA models and achieving new state-of-the-art results on multiple datasets and decreasing model latency by up to 20%.

* 10 pages,5 figures, 4 tables, 1 code snippet

View paper on

Share this with someone who'll enjoy it:

Title:T-EMDE: Sketching-based global similarity for cross-modal retrieval

Paper and Code