Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ali Furkan Biten

Show, Interpret and Tell: Entity-aware Contextualised Image Captioning in Wikipedia

Sep 21, 2022

Khanh Nguyen, Ali Furkan Biten, Andres Mafla, Lluis Gomez, Dimosthenis Karatzas

Figure 1 for Show, Interpret and Tell: Entity-aware Contextualised Image Captioning in Wikipedia

Figure 2 for Show, Interpret and Tell: Entity-aware Contextualised Image Captioning in Wikipedia

Figure 3 for Show, Interpret and Tell: Entity-aware Contextualised Image Captioning in Wikipedia

Figure 4 for Show, Interpret and Tell: Entity-aware Contextualised Image Captioning in Wikipedia

Abstract:Humans exploit prior knowledge to describe images, and are able to adapt their explanation to specific contextual information, even to the extent of inventing plausible explanations when contextual information and images do not match. In this work, we propose the novel task of captioning Wikipedia images by integrating contextual knowledge. Specifically, we produce models that jointly reason over Wikipedia articles, Wikimedia images and their associated descriptions to produce contextualized captions. Particularly, a similar Wikimedia image can be used to illustrate different articles, and the produced caption needs to be adapted to a specific context, therefore allowing us to explore the limits of a model to adjust captions to different contextual information. A particular challenging task in this domain is dealing with out-of-dictionary words and Named Entities. To address this, we propose a pre-training objective, Masked Named Entity Modeling (MNEM), and show that this pretext task yields an improvement compared to baseline models. Furthermore, we verify that a model pre-trained with the MNEM objective in Wikipedia generalizes well to a News Captioning dataset. Additionally, we define two different test splits according to the difficulty of the captioning task. We offer insights on the role and the importance of each modality and highlight the limitations of our model. The code, models and data splits are publicly available at Upon acceptance.

Via

Access Paper or Ask Questions

MUST-VQA: MUltilingual Scene-text VQA

Sep 14, 2022

Emanuele Vivoli, Ali Furkan Biten, Andres Mafla, Dimosthenis Karatzas, Lluis Gomez

Figure 1 for MUST-VQA: MUltilingual Scene-text VQA

Figure 2 for MUST-VQA: MUltilingual Scene-text VQA

Figure 3 for MUST-VQA: MUltilingual Scene-text VQA

Figure 4 for MUST-VQA: MUltilingual Scene-text VQA

Abstract:In this paper, we present a framework for Multilingual Scene Text Visual Question Answering that deals with new languages in a zero-shot fashion. Specifically, we consider the task of Scene Text Visual Question Answering (STVQA) in which the question can be asked in different languages and it is not necessarily aligned to the scene text language. Thus, we first introduce a natural step towards a more generalized version of STVQA: MUST-VQA. Accounting for this, we discuss two evaluation scenarios in the constrained setting, namely IID and zero-shot and we demonstrate that the models can perform on a par on a zero-shot setting. We further provide extensive experimentation and show the effectiveness of adapting multilingual language models into STVQA tasks.

* To be appeared in Text In Everything Workshop in ECCV 2022

Via

Access Paper or Ask Questions

Out-of-Vocabulary Challenge Report

Sep 14, 2022

Sergi Garcia-Bordils, Andrés Mafla, Ali Furkan Biten, Oren Nuriel, Aviad Aberdam, Shai Mazor, Ron Litman, Dimosthenis Karatzas

Figure 1 for Out-of-Vocabulary Challenge Report

Figure 2 for Out-of-Vocabulary Challenge Report

Figure 3 for Out-of-Vocabulary Challenge Report

Figure 4 for Out-of-Vocabulary Challenge Report

Abstract:This paper presents final results of the Out-Of-Vocabulary 2022 (OOV) challenge. The OOV contest introduces an important aspect that is not commonly studied by Optical Character Recognition (OCR) models, namely, the recognition of unseen scene text instances at training time. The competition compiles a collection of public scene text datasets comprising of 326,385 images with 4,864,405 scene text instances, thus covering a wide range of data distributions. A new and independent validation and test set is formed with scene text instances that are out of vocabulary at training time. The competition was structured in two tasks, end-to-end and cropped scene text recognition respectively. A thorough analysis of results from baselines and different participants is presented. Interestingly, current state-of-the-art models show a significant performance gap under the newly studied setting. We conclude that the OOV dataset proposed in this challenge will be an essential area to be explored in order to develop scene text models that achieve more robust and generalized predictions.

* To be appeared in Text In Everything Workshop in ECCV 2022

Via

Access Paper or Ask Questions

Text-DIAE: Degradation Invariant Autoencoders for Text Recognition and Document Enhancement

Mar 16, 2022

Mohamed Ali Souibgui, Sanket Biswas, Andres Mafla, Ali Furkan Biten, Alicia Fornés, Yousri Kessentini, Josep Lladós, Lluis Gomez, Dimosthenis Karatzas

Figure 1 for Text-DIAE: Degradation Invariant Autoencoders for Text Recognition and Document Enhancement

Figure 2 for Text-DIAE: Degradation Invariant Autoencoders for Text Recognition and Document Enhancement

Figure 3 for Text-DIAE: Degradation Invariant Autoencoders for Text Recognition and Document Enhancement

Figure 4 for Text-DIAE: Degradation Invariant Autoencoders for Text Recognition and Document Enhancement

Abstract:In this work, we propose Text-Degradation Invariant Auto Encoder (Text-DIAE) aimed to solve two tasks, text recognition (handwritten or scene-text) and document image enhancement. We define three pretext tasks as learning objectives to be optimized during pre-training without the usage of labelled data. Each of the pre-text objectives is specifically tailored for the final downstream tasks. We conduct several ablation experiments that show the importance of each degradation for a specific domain. Exhaustive experimentation shows that our method does not have limitations of previous state-of-the-art based on contrastive losses while at the same time requiring essentially fewer data samples to converge. Finally, we demonstrate that our method surpasses the state-of-the-art significantly in existing supervised and self-supervised settings in handwritten and scene text recognition and document image enhancement. Our code and trained models will be made publicly available at~\url{ http://Upon_Acceptance}.

* Preprint

Via

Access Paper or Ask Questions

OCR-IDL: OCR Annotations for Industry Document Library Dataset

Feb 25, 2022

Ali Furkan Biten, Rubèn Tito, Lluis Gomez, Ernest Valveny, Dimosthenis Karatzas

Figure 1 for OCR-IDL: OCR Annotations for Industry Document Library Dataset

Figure 2 for OCR-IDL: OCR Annotations for Industry Document Library Dataset

Figure 3 for OCR-IDL: OCR Annotations for Industry Document Library Dataset

Figure 4 for OCR-IDL: OCR Annotations for Industry Document Library Dataset

Abstract:Pretraining has proven successful in Document Intelligence tasks where deluge of documents are used to pretrain the models only later to be finetuned on downstream tasks. One of the problems of the pretraining approaches is the inconsistent usage of pretraining data with different OCR engines leading to incomparable results between models. In other words, it is not obvious whether the performance gain is coming from diverse usage of amount of data and distinct OCR engines or from the proposed models. To remedy the problem, we make public the OCR annotations for IDL documents using commercial OCR engine given their superior performance over open source OCR models. The contributed dataset (OCR-IDL) has an estimated monetary value over 20K US$. It is our hope that OCR-IDL can be a starting point for future works on Document Intelligence. All of our data and its collection process with the annotations can be found in https://github.com/furkanbiten/idl_data.

Via

Access Paper or Ask Questions

LaTr: Layout-Aware Transformer for Scene-Text VQA

Dec 24, 2021

Ali Furkan Biten, Ron Litman, Yusheng Xie, Srikar Appalaraju, R. Manmatha

Figure 1 for LaTr: Layout-Aware Transformer for Scene-Text VQA

Figure 2 for LaTr: Layout-Aware Transformer for Scene-Text VQA

Figure 3 for LaTr: Layout-Aware Transformer for Scene-Text VQA

Figure 4 for LaTr: Layout-Aware Transformer for Scene-Text VQA

Abstract:We propose a novel multimodal architecture for Scene Text Visual Question Answering (STVQA), named Layout-Aware Transformer (LaTr). The task of STVQA requires models to reason over different modalities. Thus, we first investigate the impact of each modality, and reveal the importance of the language module, especially when enriched with layout information. Accounting for this, we propose a single objective pre-training scheme that requires only text and spatial cues. We show that applying this pre-training scheme on scanned documents has certain advantages over using natural images, despite the domain gap. Scanned documents are easy to procure, text-dense and have a variety of layouts, helping the model learn various spatial cues (e.g. left-of, below etc.) by tying together language and layout information. Compared to existing approaches, our method performs vocabulary-free decoding and, as shown, generalizes well beyond the training vocabulary. We further demonstrate that LaTr improves robustness towards OCR errors, a common reason for failure cases in STVQA. In addition, by leveraging a vision transformer, we eliminate the need for an external object detector. LaTr outperforms state-of-the-art STVQA methods on multiple datasets. In particular, +7.6% on TextVQA, +10.8% on ST-VQA and +4.0% on OCR-VQA (all absolute accuracy numbers).

Via

Access Paper or Ask Questions

Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching

Oct 06, 2021

Ali Furkan Biten, Andres Mafla, Lluis Gomez, Dimosthenis Karatzas

Figure 1 for Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching

Figure 2 for Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching

Figure 3 for Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching

Figure 4 for Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching

Abstract:The task of image-text matching aims to map representations from different modalities into a common joint visual-textual embedding. However, the most widely used datasets for this task, MSCOCO and Flickr30K, are actually image captioning datasets that offer a very limited set of relationships between images and sentences in their ground-truth annotations. This limited ground truth information forces us to use evaluation metrics based on binary relevance: given a sentence query we consider only one image as relevant. However, many other relevant images or captions may be present in the dataset. In this work, we propose two metrics that evaluate the degree of semantic relevance of retrieved items, independently of their annotated binary relevance. Additionally, we incorporate a novel strategy that uses an image captioning metric, CIDEr, to define a Semantic Adaptive Margin (SAM) to be optimized in a standard triplet loss. By incorporating our formulation to existing models, a \emph{large} improvement is obtained in scenarios where available training data is limited. We also demonstrate that the performance on the annotated image-caption pairs is maintained while improving on other non-annotated relevant items when employing the full training set. Code with our metrics and adaptive margin formulation will be made public.

* Accepted WACV 2022

Via

Access Paper or Ask Questions

Let there be a clock on the beach: Reducing Object Hallucination in Image Captioning

Oct 04, 2021

Ali Furkan Biten, Lluis Gomez, Dimosthenis Karatzas

Figure 1 for Let there be a clock on the beach: Reducing Object Hallucination in Image Captioning

Figure 2 for Let there be a clock on the beach: Reducing Object Hallucination in Image Captioning

Figure 3 for Let there be a clock on the beach: Reducing Object Hallucination in Image Captioning

Figure 4 for Let there be a clock on the beach: Reducing Object Hallucination in Image Captioning

Abstract:Explaining an image with missing or non-existent objects is known as object bias (hallucination) in image captioning. This behaviour is quite common in the state-of-the-art captioning models which is not desirable by humans. To decrease the object hallucination in captioning, we propose three simple yet efficient training augmentation method for sentences which requires no new training data or increase in the model size. By extensive analysis, we show that the proposed methods can significantly diminish our models' object bias on hallucination metrics. Moreover, we experimentally demonstrate that our methods decrease the dependency on the visual features. All of our code, configuration files and model weights will be made public.

* Accepted to WACV 2022

Via

Access Paper or Ask Questions

Localizing Infinity-shaped fishes: Sketch-guided object localization in the wild

Sep 24, 2021

Pau Riba, Sounak Dey, Ali Furkan Biten, Josep Llados

Figure 1 for Localizing Infinity-shaped fishes: Sketch-guided object localization in the wild

Figure 2 for Localizing Infinity-shaped fishes: Sketch-guided object localization in the wild

Figure 3 for Localizing Infinity-shaped fishes: Sketch-guided object localization in the wild

Figure 4 for Localizing Infinity-shaped fishes: Sketch-guided object localization in the wild

Abstract:This work investigates the problem of sketch-guided object localization (SGOL), where human sketches are used as queries to conduct the object localization in natural images. In this cross-modal setting, we first contribute with a tough-to-beat baseline that without any specific SGOL training is able to outperform the previous works on a fixed set of classes. The baseline is useful to analyze the performance of SGOL approaches based on available simple yet powerful methods. We advance prior arts by proposing a sketch-conditioned DETR (DEtection TRansformer) architecture which avoids a hard classification and alleviates the domain gap between sketches and images to localize object instances. Although the main goal of SGOL is focused on object detection, we explored its natural extension to sketch-guided instance segmentation. This novel task allows to move towards identifying the objects at pixel level, which is of key importance in several applications. We experimentally demonstrate that our model and its variants significantly advance over previous state-of-the-art results. All training and testing code of our model will be released to facilitate future research{{https://github.com/priba/sgol_wild}}.

* Under Review

Via

Access Paper or Ask Questions

One-shot Compositional Data Generation for Low Resource Handwritten Text Recognition

May 11, 2021

Mohamed Ali Souibgui, Ali Furkan Biten, Sounak Dey, Alicia Fornés, Yousri Kessentini, Lluis Gomez, Dimosthenis Karatzas, Josep Lladós

Figure 1 for One-shot Compositional Data Generation for Low Resource Handwritten Text Recognition

Figure 2 for One-shot Compositional Data Generation for Low Resource Handwritten Text Recognition

Figure 3 for One-shot Compositional Data Generation for Low Resource Handwritten Text Recognition

Figure 4 for One-shot Compositional Data Generation for Low Resource Handwritten Text Recognition

Abstract:Low resource Handwritten Text Recognition (HTR) is a hard problem due to the scarce annotated data and the very limited linguistic information (dictionaries and language models). This appears, for example, in the case of historical ciphered manuscripts, which are usually written with invented alphabets to hide the content. Thus, in this paper we address this problem through a data generation technique based on Bayesian Program Learning (BPL). Contrary to traditional generation approaches, which require a huge amount of annotated images, our method is able to generate human-like handwriting using only one sample of each symbol from the desired alphabet. After generating symbols, we create synthetic lines to train state-of-the-art HTR architectures in a segmentation free fashion. Quantitative and qualitative analyses were carried out and confirm the effectiveness of the proposed method, achieving competitive results compared to the usage of real annotated data.

Via

Access Paper or Ask Questions