Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Geonmo Gu

Language-only Efficient Training of Zero-shot Composed Image Retrieval

Dec 04, 2023

Geonmo Gu, Sanghyuk Chun, Wonjae Kim, Yoohoon Kang, Sangdoo Yun

Abstract:Composed image retrieval (CIR) task takes a composed query of image and text, aiming to search relative images for both conditions. Conventional CIR approaches need a training dataset composed of triplets of query image, query text, and target image, which is very expensive to collect. Several recent works have worked on the zero-shot (ZS) CIR paradigm to tackle the issue without using pre-collected triplets. However, the existing ZS-CIR methods show limited backbone scalability and generalizability due to the lack of diversity of the input texts during training. We propose a novel CIR framework, only using language for its training. Our LinCIR (Language-only training for CIR) can be trained only with text datasets by a novel self-supervision named self-masking projection (SMP). We project the text latent embedding to the token embedding space and construct a new text by replacing the keyword tokens of the original text. Then, we let the new and original texts have the same latent embedding vector. With this simple strategy, LinCIR is surprisingly efficient and highly effective; LinCIR with CLIP ViT-G backbone is trained in 48 minutes and shows the best ZS-CIR performances on four different CIR benchmarks, CIRCO, GeneCIS, FashionIQ, and CIRR, even outperforming supervised method on FashionIQ. Code is available at https://github.com/navervision/lincir

* First two authors contributed equally; 16 pages, 2.9MB

Via

Access Paper or Ask Questions

CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion

Mar 21, 2023

Geonmo Gu, Sanghyuk Chun, Wonjae Kim, HeeJae Jun, Yoohoon Kang, Sangdoo Yun

Abstract:This paper proposes a novel diffusion-based model, CompoDiff, for solving Composed Image Retrieval (CIR) with latent diffusion and presents a newly created dataset of 18 million reference images, conditions, and corresponding target image triplets to train the model. CompoDiff not only achieves a new zero-shot state-of-the-art on a CIR benchmark such as FashionIQ but also enables a more versatile CIR by accepting various conditions, such as negative text and image mask conditions, which are unavailable with existing CIR methods. In addition, the CompoDiff features are on the intact CLIP embedding space so that they can be directly used for all existing models exploiting the CLIP space. The code and dataset used for the training, and the pre-trained weights are available at https://github.com/navervision/CompoDiff

* First two authors contributed equally; 23 pages, 4.8MB

Via

Access Paper or Ask Questions

Group Generalized Mean Pooling for Vision Transformer

Dec 08, 2022

Byungsoo Ko, Han-Gyu Kim, Byeongho Heo, Sangdoo Yun, Sanghyuk Chun, Geonmo Gu, Wonjae Kim

Abstract:Vision Transformer (ViT) extracts the final representation from either class token or an average of all patch tokens, following the architecture of Transformer in Natural Language Processing (NLP) or Convolutional Neural Networks (CNNs) in computer vision. However, studies for the best way of aggregating the patch tokens are still limited to average pooling, while widely-used pooling strategies, such as max and GeM pooling, can be considered. Despite their effectiveness, the existing pooling strategies do not consider the architecture of ViT and the channel-wise difference in the activation maps, aggregating the crucial and trivial channels with the same importance. In this paper, we present Group Generalized Mean (GGeM) pooling as a simple yet powerful pooling strategy for ViT. GGeM divides the channels into groups and computes GeM pooling with a shared pooling parameter per group. As ViT groups the channels via a multi-head attention mechanism, grouping the channels by GGeM leads to lower head-wise dependence while amplifying important channels on the activation maps. Exploiting GGeM shows 0.1%p to 0.7%p performance boosts compared to the baselines and achieves state-of-the-art performance for ViT-Base and ViT-Large models in ImageNet-1K classification task. Moreover, GGeM outperforms the existing pooling strategies on image retrieval and multi-modal representation learning tasks, demonstrating the superiority of GGeM for a variety of tasks. GGeM is a simple algorithm in that only a few lines of code are necessary for implementation.

Via

Access Paper or Ask Questions

Granularity-aware Adaptation for Image Retrieval over Multiple Tasks

Oct 05, 2022

Jon Almazán, Byungsoo Ko, Geonmo Gu, Diane Larlus, Yannis Kalantidis

Figure 1 for Granularity-aware Adaptation for Image Retrieval over Multiple Tasks

Figure 2 for Granularity-aware Adaptation for Image Retrieval over Multiple Tasks

Figure 3 for Granularity-aware Adaptation for Image Retrieval over Multiple Tasks

Figure 4 for Granularity-aware Adaptation for Image Retrieval over Multiple Tasks

Abstract:Strong image search models can be learned for a specific domain, ie. set of labels, provided that some labeled images of that domain are available. A practical visual search model, however, should be versatile enough to solve multiple retrieval tasks simultaneously, even if those cover very different specialized domains. Additionally, it should be able to benefit from even unlabeled images from these various retrieval tasks. This is the more practical scenario that we consider in this paper. We address it with the proposed Grappa, an approach that starts from a strong pretrained model, and adapts it to tackle multiple retrieval tasks concurrently, using only unlabeled images from the different task domains. We extend the pretrained model with multiple independently trained sets of adaptors that use pseudo-label sets of different sizes, effectively mimicking different pseudo-granularities. We reconcile all adaptor sets into a single unified model suited for all retrieval tasks by learning fusion layers that we guide by propagating pseudo-granularity attentions across neighbors in the feature space. Results on a benchmark composed of six heterogeneous retrieval tasks show that the unsupervised Grappa model improves the zero-shot performance of a state-of-the-art self-supervised learning model, and in some places reaches or improves over a task label-aware oracle that selects the most fitting pseudo-granularity per task.

* ECCV 2022

Via

Access Paper or Ask Questions

Large-scale Bilingual Language-Image Contrastive Learning

Apr 15, 2022

Byungsoo Ko, Geonmo Gu

Figure 1 for Large-scale Bilingual Language-Image Contrastive Learning

Figure 2 for Large-scale Bilingual Language-Image Contrastive Learning

Figure 3 for Large-scale Bilingual Language-Image Contrastive Learning

Figure 4 for Large-scale Bilingual Language-Image Contrastive Learning

Abstract:This paper is a technical report to share our experience and findings building a Korean and English bilingual multimodal model. While many of the multimodal datasets focus on English and multilingual multimodal research uses machine-translated texts, employing such machine-translated texts is limited to describing unique expressions, cultural information, and proper noun in languages other than English. In this work, we collect 1.1 billion image-text pairs (708 million Korean and 476 million English) and train a bilingual multimodal model named KELIP. We introduce simple yet effective training schemes, including MAE pre-training and multi-crop augmentation. Extensive experiments demonstrate that a model trained with such training schemes shows competitive performance in both languages. Moreover, we discuss multimodal-related research questions: 1) strong augmentation-based methods can distract the model from learning proper multimodal relations; 2) training multimodal model without cross-lingual relation can learn the relation via visual semantics; 3) our bilingual KELIP can capture cultural differences of visual semantics for the same meaning of words; 4) a large-scale multimodal model can be used for multimodal feature analogy. We hope that this work will provide helpful experience and findings for future research. We provide an open-source pre-trained KELIP.

* Accepted by ICLRW2022

Via

Access Paper or Ask Questions

Self-Distilled Hashing for Deep Image Retrieval

Dec 16, 2021

Young Kyun Jang, Geonmo Gu, Byungsoo Ko, Nam Ik Cho

Figure 1 for Self-Distilled Hashing for Deep Image Retrieval

Figure 2 for Self-Distilled Hashing for Deep Image Retrieval

Figure 3 for Self-Distilled Hashing for Deep Image Retrieval

Figure 4 for Self-Distilled Hashing for Deep Image Retrieval

Abstract:In hash-based image retrieval systems, the transformed input from the original usually generates different codes, deteriorating the retrieval accuracy. To mitigate this issue, data augmentation can be applied during training. However, even if the augmented samples of one content are similar in real space, the quantization can scatter them far away in Hamming space. This results in representation discrepancies that can impede training and degrade performance. In this work, we propose a novel self-distilled hashing scheme to minimize the discrepancy while exploiting the potential of augmented data. By transferring the hash knowledge of the weakly-transformed samples to the strong ones, we make the hash code insensitive to various transformations. We also introduce hash proxy-based similarity learning and binary cross entropy-based quantization loss to provide fine quality hash codes. Ultimately, we construct a deep hashing framework that generates discriminative hash codes. Extensive experiments on benchmarks verify that our self-distillation improves the existing deep hashing approaches, and our framework achieves state-of-the-art retrieval results. The code will be released soon.

Via

Access Paper or Ask Questions

Towards Real-time and Light-weight Line Segment Detection

Jun 01, 2021

Geonmo Gu, Byungsoo Ko, SeoungHyun Go, Sung-Hyun Lee, Jingeun Lee, Minchul Shin

Figure 1 for Towards Real-time and Light-weight Line Segment Detection

Figure 2 for Towards Real-time and Light-weight Line Segment Detection

Figure 3 for Towards Real-time and Light-weight Line Segment Detection

Figure 4 for Towards Real-time and Light-weight Line Segment Detection

Abstract:Previous deep learning-based line segment detection (LSD) suffer from the immense model size and high computational cost for line prediction. This constrains them from real-time inference on computationally restricted environments. In this paper, we propose a real-time and light-weight line segment detector for resource-constrained environments named Mobile LSD (M-LSD). We design an extremely efficient LSD architecture by minimizing the backbone network and removing the typical multi-module process for line prediction in previous methods. To maintain competitive performance with such a light-weight network, we present novel training schemes: Segments of Line segment (SoL) augmentation and geometric learning scheme. SoL augmentation splits a line segment into multiple subparts, which are used to provide auxiliary line data during the training process. Moreover, the geometric learning scheme allows a model to capture additional geometry cues from matching loss, junction and line segmentation, length and degree regression. Compared with TP-LSD-Lite, previously the best real-time LSD method, our model (M-LSD-tiny) achieves competitive performance with 2.5% of model size and an increase of 130.5% in inference speed on GPU when evaluated with Wireframe and YorkUrban datasets. Furthermore, our model runs at 56.8 FPS and 48.6 FPS on Android and iPhone mobile devices, respectively. To the best of our knowledge, this is the first real-time deep LSD method available on mobile devices.

Via

Access Paper or Ask Questions

RTIC: Residual Learning for Text and Image Composition using Graph Convolutional Network

Apr 08, 2021

Minchul Shin, Yoonjae Cho, Byungsoo Ko, Geonmo Gu

Figure 1 for RTIC: Residual Learning for Text and Image Composition using Graph Convolutional Network

Figure 2 for RTIC: Residual Learning for Text and Image Composition using Graph Convolutional Network

Figure 3 for RTIC: Residual Learning for Text and Image Composition using Graph Convolutional Network

Figure 4 for RTIC: Residual Learning for Text and Image Composition using Graph Convolutional Network

Abstract:In this paper, we study the compositional learning of images and texts for image retrieval. The query is given in the form of an image and text that describes the desired modifications to the image; the goal is to retrieve the target image that satisfies the given modifications and resembles the query by composing information in both the text and image modalities. To accomplish this task, we propose a simple new architecture using skip connections that can effectively encode the errors between the source and target images in the latent space. Furthermore, we introduce a novel method that combines the graph convolutional network (GCN) with existing composition methods. We find that the combination consistently improves the performance in a plug-and-play manner. We perform thorough and exhaustive experiments on several widely used datasets, and achieve state-of-the-art scores on the task with our model. To ensure fairness in comparison, we suggest a strict standard for the evaluation because a small difference in the training conditions can significantly affect the final performance. We release our implementation, including that of all the compared methods, for reproducibility.

Via

Access Paper or Ask Questions

Learning with Memory-based Virtual Classes for Deep Metric Learning

Mar 31, 2021

Byungsoo Ko, Geonmo Gu, Han-Gyu Kim

Figure 1 for Learning with Memory-based Virtual Classes for Deep Metric Learning

Figure 2 for Learning with Memory-based Virtual Classes for Deep Metric Learning

Figure 3 for Learning with Memory-based Virtual Classes for Deep Metric Learning

Figure 4 for Learning with Memory-based Virtual Classes for Deep Metric Learning

Abstract:The core of deep metric learning (DML) involves learning visual similarities in high-dimensional embedding space. One of the main challenges is to generalize from seen classes of training data to unseen classes of test data. Recent works have focused on exploiting past embeddings to increase the number of instances for the seen classes. Such methods achieve performance improvement via augmentation, while the strong focus on seen classes still remains. This can be undesirable for DML, where training and test data exhibit entirely different classes. In this work, we present a novel training strategy for DML called MemVir. Unlike previous works, MemVir memorizes both embedding features and class weights to utilize them as additional virtual classes. The exploitation of virtual classes not only utilizes augmented information for training but also alleviates a strong focus on seen classes for better generalization. Moreover, we embed the idea of curriculum learning by slowly adding virtual classes for a gradual increase in learning difficulty, which improves the learning stability as well as the final performance. MemVir can be easily applied to many existing loss functions without any modification. Extensive experimental results on famous benchmarks demonstrate the superiority of MemVir over state-of-the-art competitors. Code of MemVir will be publicly available.

Via

Access Paper or Ask Questions

Proxy Synthesis: Learning with Synthetic Classes for Deep Metric Learning

Mar 29, 2021

Geonmo Gu, Byungsoo Ko, Han-Gyu Kim

Figure 1 for Proxy Synthesis: Learning with Synthetic Classes for Deep Metric Learning

Figure 2 for Proxy Synthesis: Learning with Synthetic Classes for Deep Metric Learning

Figure 3 for Proxy Synthesis: Learning with Synthetic Classes for Deep Metric Learning

Figure 4 for Proxy Synthesis: Learning with Synthetic Classes for Deep Metric Learning

Abstract:One of the main purposes of deep metric learning is to construct an embedding space that has well-generalized embeddings on both seen (training) classes and unseen (test) classes. Most existing works have tried to achieve this using different types of metric objectives and hard sample mining strategies with given training data. However, learning with only the training data can be overfitted to the seen classes, leading to the lack of generalization capability on unseen classes. To address this problem, we propose a simple regularizer called Proxy Synthesis that exploits synthetic classes for stronger generalization in deep metric learning. The proposed method generates synthetic embeddings and proxies that work as synthetic classes, and they mimic unseen classes when computing proxy-based losses. Proxy Synthesis derives an embedding space considering class relations and smooth decision boundaries for robustness on unseen classes. Our method is applicable to any proxy-based losses, including softmax and its variants. Extensive experiments on four famous benchmarks in image retrieval tasks demonstrate that Proxy Synthesis significantly boosts the performance of proxy-based losses and achieves state-of-the-art performance.

* Accepted by AAAI2021

Via

Access Paper or Ask Questions