Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kofi Boakye

Image Search with Text Feedback by Additive Attention Compositional Learning

Mar 08, 2022

Yuxin Tian, Shawn Newsam, Kofi Boakye

Figure 1 for Image Search with Text Feedback by Additive Attention Compositional Learning

Figure 2 for Image Search with Text Feedback by Additive Attention Compositional Learning

Figure 3 for Image Search with Text Feedback by Additive Attention Compositional Learning

Figure 4 for Image Search with Text Feedback by Additive Attention Compositional Learning

Abstract:Effective image retrieval with text feedback stands to impact a range of real-world applications, such as e-commerce. Given a source image and text feedback that describes the desired modifications to that image, the goal is to retrieve the target images that resemble the source yet satisfy the given modifications by composing a multi-modal (image-text) query. We propose a novel solution to this problem, Additive Attention Compositional Learning (AACL), that uses a multi-modal transformer-based architecture and effectively models the image-text contexts. Specifically, we propose a novel image-text composition module based on additive attention that can be seamlessly plugged into deep neural networks. We also introduce a new challenging benchmark derived from the Shopping100k dataset. AACL is evaluated on three large-scale datasets (FashionIQ, Fashion200k, and Shopping100k), each with strong baselines. Extensive experiments show that AACL achieves new state-of-the-art results on all three datasets.

Via

Access Paper or Ask Questions

Modality-Agnostic Attention Fusion for visual search with text feedback

Jun 30, 2020

Eric Dodds, Jack Culpepper, Simao Herdade, Yang Zhang, Kofi Boakye

Figure 1 for Modality-Agnostic Attention Fusion for visual search with text feedback

Figure 2 for Modality-Agnostic Attention Fusion for visual search with text feedback

Figure 3 for Modality-Agnostic Attention Fusion for visual search with text feedback

Figure 4 for Modality-Agnostic Attention Fusion for visual search with text feedback

Abstract:Image retrieval with natural language feedback offers the promise of catalog search based on fine-grained visual features that go beyond objects and binary attributes, facilitating real-world applications such as e-commerce. Our Modality-Agnostic Attention Fusion (MAAF) model combines image and text features and outperforms existing approaches on two visual search with modifying phrase datasets, Fashion IQ and CSS, and performs competitively on a dataset with only single-word modifications, Fashion200k. We also introduce two new challenging benchmarks adapted from Birds-to-Words and Spot-the-Diff, which provide new settings with rich language inputs, and we show that our approach without modification outperforms strong baselines. To better understand our model, we conduct detailed ablations on Fashion IQ and provide visualizations of the surprising phenomenon of words avoiding "attending" to the image region they refer to.

* 14 pages, 8 figures

Via

Access Paper or Ask Questions

Image Captioning: Transforming Objects into Words

Jun 14, 2019

Simao Herdade, Armin Kappeler, Kofi Boakye, Joao Soares

Figure 1 for Image Captioning: Transforming Objects into Words

Figure 2 for Image Captioning: Transforming Objects into Words

Figure 3 for Image Captioning: Transforming Objects into Words

Figure 4 for Image Captioning: Transforming Objects into Words

Abstract:Image captioning models typically follow an encoder-decoder architecture which uses abstract image feature vectors as input to the encoder. One of the most successful algorithms uses feature vectors extracted from the region proposals obtained from an object detector. In this work we introduce the Object Relation Transformer, that builds upon this approach by explicitly incorporating information about the spatial relationship between input detected objects through geometric attention. Quantitative and qualitative results demonstrate the importance of such geometric attention for image captioning, leading to improvements on all common captioning metrics on the MS-COCO dataset.

* 10 pages

Via

Access Paper or Ask Questions

Tag Prediction at Flickr: a View from the Darkroom

Dec 19, 2017

Kofi Boakye, Sachin Farfade, Hamid Izadinia, Yannis Kalantidis, Pierre Garrigues

Figure 1 for Tag Prediction at Flickr: a View from the Darkroom

Figure 2 for Tag Prediction at Flickr: a View from the Darkroom

Figure 3 for Tag Prediction at Flickr: a View from the Darkroom

Abstract:Automated photo tagging has established itself as one of the most compelling applications of deep learning. While deep convolutional neural networks have repeatedly demonstrated top performance on standard datasets for classification, there are a number of often overlooked but important considerations when deploying this technology in a real-world scenario. In this paper, we present our efforts in developing a large-scale photo tagging system for Flickr photo search. We discuss topics including how to 1) select the tags that matter most to our users; 2) develop lightweight, high-performance models for tag prediction; and 3) leverage the power of large amounts of noisy data for training. Our results demonstrate that, for real-world datasets, training exclusively with this noisy data yields performance on par with the standard paradigm of first pre-training on clean data and then fine-tuning. In addition, we observe that the models trained with user-generated data can yield better fine-tuning results when a small amount of clean data is available. As such, we advocate for the approach of harnessing user-generated data in large-scale systems.

* Presented at the ACM Multimedia Thematic Workshops, 2017

Via

Access Paper or Ask Questions

A Large Contextual Dataset for Classification, Detection and Counting of Cars with Deep Learning

Sep 14, 2016

T. Nathan Mundhenk, Goran Konjevod, Wesam A. Sakla, Kofi Boakye

Figure 1 for A Large Contextual Dataset for Classification, Detection and Counting of Cars with Deep Learning

Figure 2 for A Large Contextual Dataset for Classification, Detection and Counting of Cars with Deep Learning

Figure 3 for A Large Contextual Dataset for Classification, Detection and Counting of Cars with Deep Learning

Figure 4 for A Large Contextual Dataset for Classification, Detection and Counting of Cars with Deep Learning

Abstract:We have created a large diverse set of cars from overhead images, which are useful for training a deep learner to binary classify, detect and count them. The dataset and all related material will be made publically available. The set contains contextual matter to aid in identification of difficult targets. We demonstrate classification and detection on this dataset using a neural network we call ResCeption. This network combines residual learning with Inception-style layers and is used to count cars in one look. This is a new way to count objects rather than by localization or density estimation. It is fairly accurate, fast and easy to implement. Additionally, the counting method is not car or scene specific. It would be easy to train this method to count other kinds of objects and counting over new scenes requires no extra set up or assumptions about object locations.

* ECCV 2016 Pre-press revision

Via

Access Paper or Ask Questions

Large-Scale Deep Learning on the YFCC100M Dataset

Feb 11, 2015

Karl Ni, Roger Pearce, Kofi Boakye, Brian Van Essen, Damian Borth, Barry Chen, Eric Wang

Figure 1 for Large-Scale Deep Learning on the YFCC100M Dataset

Figure 2 for Large-Scale Deep Learning on the YFCC100M Dataset

Figure 3 for Large-Scale Deep Learning on the YFCC100M Dataset

Figure 4 for Large-Scale Deep Learning on the YFCC100M Dataset

Abstract:We present a work-in-progress snapshot of learning with a 15 billion parameter deep learning network on HPC architectures applied to the largest publicly available natural image and video dataset released to-date. Recent advancements in unsupervised deep neural networks suggest that scaling up such networks in both model and training dataset size can yield significant improvements in the learning of concepts at the highest layers. We train our three-layer deep neural network on the Yahoo! Flickr Creative Commons 100M dataset. The dataset comprises approximately 99.2 million images and 800,000 user-created videos from Yahoo's Flickr image and video sharing platform. Training of our network takes eight days on 98 GPU nodes at the High Performance Computing Center at Lawrence Livermore National Laboratory. Encouraging preliminary results and future research directions are presented and discussed.

Via

Access Paper or Ask Questions