Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Josep Llados

NoTeS-Bank: Benchmarking Neural Transcription and Search for Scientific Notes Understanding

Apr 12, 2025

Aniket Pal, Sanket Biswas, Alloy Das, Ayush Lodh, Priyanka Banerjee, Soumitri Chattopadhyay, Dimosthenis Karatzas, Josep Llados, C. V. Jawahar

Abstract:Understanding and reasoning over academic handwritten notes remains a challenge in document AI, particularly for mathematical equations, diagrams, and scientific notations. Existing visual question answering (VQA) benchmarks focus on printed or structured handwritten text, limiting generalization to real-world note-taking. To address this, we introduce NoTeS-Bank, an evaluation benchmark for Neural Transcription and Search in note-based question answering. NoTeS-Bank comprises complex notes across multiple domains, requiring models to process unstructured and multimodal content. The benchmark defines two tasks: (1) Evidence-Based VQA, where models retrieve localized answers with bounding-box evidence, and (2) Open-Domain VQA, where models classify the domain before retrieving relevant documents and answers. Unlike classical Document VQA datasets relying on optical character recognition (OCR) and structured data, NoTeS-BANK demands vision-language fusion, retrieval, and multimodal reasoning. We benchmark state-of-the-art Vision-Language Models (VLMs) and retrieval frameworks, exposing structured transcription and reasoning limitations. NoTeS-Bank provides a rigorous evaluation with NDCG@5, MRR, Recall@K, IoU, and ANLS, establishing a new standard for visual document understanding and reasoning.

Via

Access Paper or Ask Questions

NeurIPS 2023 Competition: Privacy Preserving Federated Learning Document VQA

Nov 06, 2024

Marlon Tobaben, Mohamed Ali Souibgui, Rubèn Tito, Khanh Nguyen, Raouf Kerkouche, Kangsoo Jung, Joonas Jälkö, Lei Kang, Andrey Barsky, Vincent Poulain d'Andecy(+17 more)

Figure 1 for NeurIPS 2023 Competition: Privacy Preserving Federated Learning Document VQA

Figure 2 for NeurIPS 2023 Competition: Privacy Preserving Federated Learning Document VQA

Figure 3 for NeurIPS 2023 Competition: Privacy Preserving Federated Learning Document VQA

Figure 4 for NeurIPS 2023 Competition: Privacy Preserving Federated Learning Document VQA

Abstract:The Privacy Preserving Federated Learning Document VQA (PFL-DocVQA) competition challenged the community to develop provably private and communication-efficient solutions in a federated setting for a real-life use case: invoice processing. The competition introduced a dataset of real invoice documents, along with associated questions and answers requiring information extraction and reasoning over the document images. Thereby, it brings together researchers and expertise from the document analysis, privacy, and federated learning communities. Participants fine-tuned a pre-trained, state-of-the-art Document Visual Question Answering model provided by the organizers for this new domain, mimicking a typical federated invoice processing setup. The base model is a multi-modal generative language model, and sensitive information could be exposed through either the visual or textual input modality. Participants proposed elegant solutions to reduce communication costs while maintaining a minimum utility threshold in track 1 and to protect all information from each document provider using differential privacy in track 2. The competition served as a new testbed for developing and testing private federated learning methods, simultaneously raising awareness about privacy within the document image analysis and recognition community. Ultimately, the competition analysis provides best practices and recommendations for successfully running privacy-focused federated learning challenges in the future.

* 27 pages, 6 figures

Via

Access Paper or Ask Questions

Localizing Infinity-shaped fishes: Sketch-guided object localization in the wild

Sep 24, 2021

Pau Riba, Sounak Dey, Ali Furkan Biten, Josep Llados

Figure 1 for Localizing Infinity-shaped fishes: Sketch-guided object localization in the wild

Figure 2 for Localizing Infinity-shaped fishes: Sketch-guided object localization in the wild

Figure 3 for Localizing Infinity-shaped fishes: Sketch-guided object localization in the wild

Figure 4 for Localizing Infinity-shaped fishes: Sketch-guided object localization in the wild

Abstract:This work investigates the problem of sketch-guided object localization (SGOL), where human sketches are used as queries to conduct the object localization in natural images. In this cross-modal setting, we first contribute with a tough-to-beat baseline that without any specific SGOL training is able to outperform the previous works on a fixed set of classes. The baseline is useful to analyze the performance of SGOL approaches based on available simple yet powerful methods. We advance prior arts by proposing a sketch-conditioned DETR (DEtection TRansformer) architecture which avoids a hard classification and alleviates the domain gap between sketches and images to localize object instances. Although the main goal of SGOL is focused on object detection, we explored its natural extension to sketch-guided instance segmentation. This novel task allows to move towards identifying the objects at pixel level, which is of key importance in several applications. We experimentally demonstrate that our model and its variants significantly advance over previous state-of-the-art results. All training and testing code of our model will be released to facilitate future research{{https://github.com/priba/sgol_wild}}.

* Under Review

Via

Access Paper or Ask Questions

Doodle to Search: Practical Zero-Shot Sketch-based Image Retrieval

Apr 06, 2019

Sounak Dey, Pau Riba, Anjan Dutta, Josep Llados, Yi-Zhe Song

Figure 1 for Doodle to Search: Practical Zero-Shot Sketch-based Image Retrieval

Figure 2 for Doodle to Search: Practical Zero-Shot Sketch-based Image Retrieval

Figure 3 for Doodle to Search: Practical Zero-Shot Sketch-based Image Retrieval

Figure 4 for Doodle to Search: Practical Zero-Shot Sketch-based Image Retrieval

Abstract:In this paper, we investigate the problem of zero-shot sketch-based image retrieval (ZS-SBIR), where human sketches are used as queries to conduct retrieval of photos from unseen categories. We importantly advance prior arts by proposing a novel ZS-SBIR scenario that represents a firm step forward in its practical application. The new setting uniquely recognizes two important yet often neglected challenges of practical ZS-SBIR, (i) the large domain gap between amateur sketch and photo, and (ii) the necessity for moving towards large-scale retrieval. We first contribute to the community a novel ZS-SBIR dataset, QuickDraw-Extended, that consists of 330,000 sketches and 204,000 photos spanning across 110 categories. Highly abstract amateur human sketches are purposefully sourced to maximize the domain gap, instead of ones included in existing datasets that can often be semi-photorealistic. We then formulate a ZS-SBIR framework to jointly model sketches and photos into a common embedding space. A novel strategy to mine the mutual information among domains is specifically engineered to alleviate the domain gap. External semantic knowledge is further embedded to aid semantic transfer. We show that, rather surprisingly, retrieval performance significantly outperforms that of state-of-the-art on existing datasets that can already be achieved using a reduced version of our model. We further demonstrate the superior performance of our full model by comparing with a number of alternatives on the newly proposed dataset. The new dataset, plus all training and testing code of our model, will be publicly released to facilitate future research

Via

Access Paper or Ask Questions

SigNet: Convolutional Siamese Network for Writer Independent Offline Signature Verification

Sep 30, 2017

Sounak Dey, Anjan Dutta, J. Ignacio Toledo, Suman K. Ghosh, Josep Llados, Umapada Pal

Figure 1 for SigNet: Convolutional Siamese Network for Writer Independent Offline Signature Verification

Figure 2 for SigNet: Convolutional Siamese Network for Writer Independent Offline Signature Verification

Figure 3 for SigNet: Convolutional Siamese Network for Writer Independent Offline Signature Verification

Figure 4 for SigNet: Convolutional Siamese Network for Writer Independent Offline Signature Verification

Abstract:Offline signature verification is one of the most challenging tasks in biometrics and document forensics. Unlike other verification problems, it needs to model minute but critical details between genuine and forged signatures, because a skilled falsification might often resembles the real signature with small deformation. This verification task is even harder in writer independent scenarios which is undeniably fiscal for realistic cases. In this paper, we model an offline writer independent signature verification task with a convolutional Siamese network. Siamese networks are twin networks with shared weights, which can be trained to learn a feature space where similar observations are placed in proximity. This is achieved by exposing the network to a pair of similar and dissimilar observations and minimizing the Euclidean distance between similar pairs while simultaneously maximizing it between dissimilar pairs. Experiments conducted on cross-domain datasets emphasize the capability of our network to model forgery in different languages (scripts) and handwriting styles. Moreover, our designed Siamese network, named SigNet, exceeds the state-of-the-art results on most of the benchmark signature datasets, which paves the way for further research in this direction.

Via

Access Paper or Ask Questions

Evaluation of the Effect of Improper Segmentation on Word Spotting

Apr 21, 2016

Sounak Dey, Anguelos Nicolaou, Josep Llados, Umapada Pal

Figure 1 for Evaluation of the Effect of Improper Segmentation on Word Spotting

Figure 2 for Evaluation of the Effect of Improper Segmentation on Word Spotting

Figure 3 for Evaluation of the Effect of Improper Segmentation on Word Spotting

Figure 4 for Evaluation of the Effect of Improper Segmentation on Word Spotting

Abstract:Word spotting is an important recognition task in historical document analysis. In most cases methods are developed and evaluated assuming perfect word segmentations. In this paper we propose an experimental framework to quantify the effect of goodness of word segmentation has on the performance achieved by word spotting methods in identical unbiased conditions. The framework consists of generating systematic distortions on segmentation and retrieving the original queries from the distorted dataset. We apply the framework on the George Washington and Barcelona Marriage Dataset and on several established and state-of-the-art methods. The experiments allow for an estimate of the end-to-end performance of word spotting methods.

Via

Access Paper or Ask Questions

Local Binary Pattern for Word Spotting in Handwritten Historical Document

Apr 21, 2016

Sounak Dey, Anguelos Nicolaou, Josep Llados, Umapada Pal

Figure 1 for Local Binary Pattern for Word Spotting in Handwritten Historical Document

Figure 2 for Local Binary Pattern for Word Spotting in Handwritten Historical Document

Figure 3 for Local Binary Pattern for Word Spotting in Handwritten Historical Document

Figure 4 for Local Binary Pattern for Word Spotting in Handwritten Historical Document

Abstract:Digital libraries store images which can be highly degraded and to index this kind of images we resort to word spot- ting as our information retrieval system. Information retrieval for handwritten document images is more challenging due to the difficulties in complex layout analysis, large variations of writing styles, and degradation or low quality of historical manuscripts. This paper presents a simple innovative learning-free method for word spotting from large scale historical documents combining Local Binary Pattern (LBP) and spatial sampling. This method offers three advantages: firstly, it operates in completely learning free paradigm which is very different from unsupervised learning methods, secondly, the computational time is significantly low because of the LBP features which are very fast to compute, and thirdly, the method can be used in scenarios where annotations are not available. Finally we compare the results of our proposed retrieval method with the other methods in the literature.

Via

Access Paper or Ask Questions