Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Diego Ortego

Large Language Models Meet Extreme Multi-label Classification: Scaling and Multi-modal Framework

Nov 17, 2025

Diego Ortego, Marlon Rodríguez, Mario Almagro, Kunal Dahiya, David Jiménez, Juan C. SanMiguel

Abstract:Foundation models have revolutionized artificial intelligence across numerous domains, yet their transformative potential remains largely untapped in Extreme Multi-label Classification (XMC). Queries in XMC are associated with relevant labels from extremely large label spaces, where it is critical to strike a balance between efficiency and performance. Therefore, many recent approaches efficiently pose XMC as a maximum inner product search between embeddings learned from small encoder-only transformer architectures. In this paper, we address two important aspects in XMC: how to effectively harness larger decoder-only models, and how to exploit visual information while maintaining computational efficiency. We demonstrate that both play a critical role in XMC separately and can be combined for improved performance. We show that a few billion-size decoder can deliver substantial improvements while keeping computational overhead manageable. Furthermore, our Vision-enhanced eXtreme Multi-label Learning framework (ViXML) efficiently integrates foundation vision models by pooling a single embedding per image. This limits computational growth while unlocking multi-modal capabilities. Remarkably, ViXML with small encoders outperforms text-only decoder in most cases, showing that an image is worth billions of parameters. Finally, we present an extension of existing text-only datasets to exploit visual metadata and make them available for future benchmarking. Comprehensive experiments across four public text-only datasets and their corresponding image enhanced versions validate our proposals' effectiveness, surpassing previous state-of-the-art by up to +8.21\% in P@1 on the largest dataset. ViXML's code is available at https://github.com/DiegoOrtego/vixml.

* To appear at AAAI 2026

Via

Access Paper or Ask Questions

Fine-grained auxiliary learning for real-world product recommendation

Oct 06, 2025

Mario Almagro, Diego Ortego, David Jimenez

Figure 1 for Fine-grained auxiliary learning for real-world product recommendation

Figure 2 for Fine-grained auxiliary learning for real-world product recommendation

Figure 3 for Fine-grained auxiliary learning for real-world product recommendation

Figure 4 for Fine-grained auxiliary learning for real-world product recommendation

Abstract:Product recommendation is the task of recovering the closest items to a given query within a large product corpora. Generally, one can determine if top-ranked products are related to the query by applying a similarity threshold; exceeding it deems the product relevant, otherwise manual revision is required. Despite being a well-known problem, the integration of these models in real-world systems is often overlooked. In particular, production systems have strong coverage requirements, i.e., a high proportion of recommendations must be automated. In this paper we propose ALC , an Auxiliary Learning strategy that boosts Coverage through learning fine-grained embeddings. Concretely, we introduce two training objectives that leverage the hardest negatives in the batch to build discriminative training signals between positives and negatives. We validate ALC using three extreme multi-label classification approaches in two product recommendation datasets; LF-AmazonTitles-131K and Tech and Durables (proprietary), demonstrating state-of-the-art coverage rates when combined with a recent threshold-consistent margin loss.

* SEPLN 2025

Via

Access Paper or Ask Questions

Prototypical Extreme Multi-label Classification with a Dynamic Margin Loss

Oct 27, 2024

Kunal Dahiya, Diego Ortego, David Jiménez

Figure 1 for Prototypical Extreme Multi-label Classification with a Dynamic Margin Loss

Figure 2 for Prototypical Extreme Multi-label Classification with a Dynamic Margin Loss

Figure 3 for Prototypical Extreme Multi-label Classification with a Dynamic Margin Loss

Figure 4 for Prototypical Extreme Multi-label Classification with a Dynamic Margin Loss

Abstract:Extreme Multi-label Classification (XMC) methods predict relevant labels for a given query in an extremely large label space. Recent works in XMC address this problem using deep encoders that project text descriptions to an embedding space suitable for recovering the closest labels. However, learning deep models can be computationally expensive in large output spaces, resulting in a trade-off between high performing brute-force approaches and efficient solutions. In this paper, we propose PRIME, a XMC method that employs a novel prototypical contrastive learning technique to reconcile efficiency and performance surpassing brute-force approaches. We frame XMC as a data-to-prototype prediction task where label prototypes aggregate information from related queries. More precisely, we use a shallow transformer encoder that we coin as Label Prototype Network, which enriches label representations by aggregating text-based embeddings, label centroids and learnable free vectors. We jointly train a deep encoder and the Label Prototype Network using an adaptive triplet loss objective that better adapts to the high granularity and ambiguity of extreme label spaces. PRIME achieves state-of-the-art results in several public benchmarks of different sizes and domains, while keeping the model efficient.

Via

Access Paper or Ask Questions

LEA: Improving Sentence Similarity Robustness to Typos Using Lexical Attention Bias

Jul 06, 2023

Mario Almagro, Emilio Almazán, Diego Ortego, David Jiménez

Figure 1 for LEA: Improving Sentence Similarity Robustness to Typos Using Lexical Attention Bias

Figure 2 for LEA: Improving Sentence Similarity Robustness to Typos Using Lexical Attention Bias

Figure 3 for LEA: Improving Sentence Similarity Robustness to Typos Using Lexical Attention Bias

Figure 4 for LEA: Improving Sentence Similarity Robustness to Typos Using Lexical Attention Bias

Abstract:Textual noise, such as typos or abbreviations, is a well-known issue that penalizes vanilla Transformers for most downstream tasks. We show that this is also the case for sentence similarity, a fundamental task in multiple domains, e.g. matching, retrieval or paraphrasing. Sentence similarity can be approached using cross-encoders, where the two sentences are concatenated in the input allowing the model to exploit the inter-relations between them. Previous works addressing the noise issue mainly rely on data augmentation strategies, showing improved robustness when dealing with corrupted samples that are similar to the ones used for training. However, all these methods still suffer from the token distribution shift induced by typos. In this work, we propose to tackle textual noise by equipping cross-encoders with a novel LExical-aware Attention module (LEA) that incorporates lexical similarities between words in both sentences. By using raw text similarities, our approach avoids the tokenization shift problem obtaining improved robustness. We demonstrate that the attention bias introduced by LEA helps cross-encoders to tackle complex scenarios with textual noise, specially in domains with short-text descriptions and limited context. Experiments using three popular Transformer encoders in five e-commerce datasets for product matching show that LEA consistently boosts performance under the presence of noise, while remaining competitive on the original (clean) splits. We also evaluate our approach in two datasets for textual entailment and paraphrasing showing that LEA is robust to typos in domains with longer sentences and more natural context. Additionally, we thoroughly analyze several design choices in our approach, providing insights about the impact of the decisions made and fostering future research in cross-encoders dealing with typos.

* KDD'23 conference (main research track). (*) These authors contributed equally

Via

Access Paper or Ask Questions

Block-SCL: Blocking Matters for Supervised Contrastive Learning in Product Matching

Jul 05, 2022

Mario Almagro, David Jiménez, Diego Ortego, Emilio Almazán, Eva Martínez

Figure 1 for Block-SCL: Blocking Matters for Supervised Contrastive Learning in Product Matching

Figure 2 for Block-SCL: Blocking Matters for Supervised Contrastive Learning in Product Matching

Figure 3 for Block-SCL: Blocking Matters for Supervised Contrastive Learning in Product Matching

Figure 4 for Block-SCL: Blocking Matters for Supervised Contrastive Learning in Product Matching

Abstract:Product matching is a fundamental step for the global understanding of consumer behavior in e-commerce. In practice, product matching refers to the task of deciding if two product offers from different data sources (e.g. retailers) represent the same product. Standard pipelines use a previous stage called blocking, where for a given product offer a set of potential matching candidates are retrieved based on similar characteristics (e.g. same brand, category, flavor, etc.). From these similar product candidates, those that are not a match can be considered hard negatives. We present Block-SCL, a strategy that uses the blocking output to make the most of Supervised Contrastive Learning (SCL). Concretely, Block-SCL builds enriched batches using the hard-negatives samples obtained in the blocking stage. These batches provide a strong training signal leading the model to learn more meaningful sentence embeddings for product matching. Experimental results in several public datasets demonstrate that Block-SCL achieves state-of-the-art results despite only using short product titles as input, no data augmentation, and a lighter transformer backbone than competing methods.

* 7 pages, 2 figures, e-commerce, conference

Via

Access Paper or Ask Questions

Domain Randomization for Object Counting

Feb 17, 2022

Enric Moreu, Kevin McGuinness, Diego Ortego, Noel E. O'Connor

Figure 1 for Domain Randomization for Object Counting

Figure 2 for Domain Randomization for Object Counting

Figure 3 for Domain Randomization for Object Counting

Figure 4 for Domain Randomization for Object Counting

Abstract:Recently, the use of synthetic datasets based on game engines has been shown to improve the performance of several tasks in computer vision. However, these datasets are typically only appropriate for the specific domains depicted in computer games, such as urban scenes involving vehicles and people. In this paper, we present an approach to generate synthetic datasets for object counting for any domain without the need for photo-realistic techniques manually generated by expensive teams of 3D artists. We introduce a domain randomization approach for object counting based on synthetic datasets that are quick and inexpensive to generate. We deliberately avoid photorealism and drastically increase the variability of the dataset, producing images with random textures and 3D transformations, which improves generalization. Experiments show that our method facilitates good performance on various real word object counting datasets for multiple domains: people, vehicles, penguins, and fruit. The source code is available at: https://github.com/enric1994/dr4oc

Via

Access Paper or Ask Questions

How Important is Importance Sampling for Deep Budgeted Training?

Oct 27, 2021

Eric Arazo, Diego Ortego, Paul Albert, Noel E. O'Connor, Kevin McGuinness

Figure 1 for How Important is Importance Sampling for Deep Budgeted Training?

Figure 2 for How Important is Importance Sampling for Deep Budgeted Training?

Figure 3 for How Important is Importance Sampling for Deep Budgeted Training?

Figure 4 for How Important is Importance Sampling for Deep Budgeted Training?

Abstract:Long iterative training processes for Deep Neural Networks (DNNs) are commonly required to achieve state-of-the-art performance in many computer vision tasks. Importance sampling approaches might play a key role in budgeted training regimes, i.e. when limiting the number of training iterations. These approaches aim at dynamically estimating the importance of each sample to focus on the most relevant and speed up convergence. This work explores this paradigm and how a budget constraint interacts with importance sampling approaches and data augmentation techniques. We show that under budget restrictions, importance sampling approaches do not provide a consistent improvement over uniform sampling. We suggest that, given a specific budget, the best course of action is to disregard the importance and introduce adequate data augmentation; e.g. when reducing the budget to a 30% in CIFAR-10/100, RICAP data augmentation maintains accuracy, while importance sampling does not. We conclude from our work that DNNs under budget restrictions benefit greatly from variety in the training set and that finding the right samples to train on is not the most effective strategy when balancing high performance with low computational requirements. Source code available at https://git.io/JKHa3 .

* British Machine Vision Conference (BMVC) 2021, oral presentation

Via

Access Paper or Ask Questions

Addressing out-of-distribution label noise in webly-labelled data

Oct 26, 2021

Paul Albert, Diego Ortego, Eric Arazo, Noel O'Connor, Kevin McGuinness

Figure 1 for Addressing out-of-distribution label noise in webly-labelled data

Figure 2 for Addressing out-of-distribution label noise in webly-labelled data

Abstract:A recurring focus of the deep learning community is towards reducing the labeling effort. Data gathering and annotation using a search engine is a simple alternative to generating a fully human-annotated and human-gathered dataset. Although web crawling is very time efficient, some of the retrieved images are unavoidably noisy, i.e. incorrectly labeled. Designing robust algorithms for training on noisy data gathered from the web is an important research perspective that would render the building of datasets easier. In this paper we conduct a study to understand the type of label noise to expect when building a dataset using a search engine. We review the current limitations of state-of-the-art methods for dealing with noisy labels for image classification tasks in the case of web noise distribution. We propose a simple solution to bridge the gap with a fully clean dataset using Dynamic Softening of Out-of-distribution Samples (DSOS), which we design on corrupted versions of the CIFAR-100 dataset, and compare against state-of-the-art algorithms on the web noise perturbated MiniImageNet and Stanford datasets and on real label noise datasets: WebVision 1.0 and Clothing1M. Our work is fully reproducible https://git.io/JKGcj

* Accepted at WACV 2022

Via

Access Paper or Ask Questions

Multi-Objective Interpolation Training for Robustness to Label Noise

Dec 08, 2020

Diego Ortego, Eric Arazo, Paul Albert, Noel E. O'Connor, Kevin McGuinness

Figure 1 for Multi-Objective Interpolation Training for Robustness to Label Noise

Figure 2 for Multi-Objective Interpolation Training for Robustness to Label Noise

Figure 3 for Multi-Objective Interpolation Training for Robustness to Label Noise

Figure 4 for Multi-Objective Interpolation Training for Robustness to Label Noise

Abstract:Deep neural networks trained with standard cross-entropy loss memorize noisy labels, which degrades their performance. Most research to mitigate this memorization proposes new robust classification loss functions. Conversely, we explore the behavior of supervised contrastive learning under label noise to understand how it can improve image classification in these scenarios. In particular, we propose a Multi-Objective Interpolation Training (MOIT) approach that jointly exploits contrastive learning and classification. We show that standard contrastive learning degrades in the presence of label noise and propose an interpolation training strategy to mitigate this behavior. We further propose a novel label noise detection method that exploits the robust feature representations learned via contrastive learning to estimate per-sample soft-labels whose disagreements with the original labels accurately identify noisy samples. This detection allows treating noisy samples as unlabeled and training a classifier in a semi-supervised manner. We further propose MOIT+, a refinement of MOIT by fine-tuning on detected clean samples. Hyperparameter and ablation studies verify the key components of our method. Experiments on synthetic and real-world noise benchmarks demonstrate that MOIT/MOIT+ achieves state-of-the-art results. Code is available at https://git.io/JI40X.

Via

Access Paper or Ask Questions

Unsupervised Contrastive Learning of Sound Event Representations

Nov 15, 2020

Eduardo Fonseca, Diego Ortego, Kevin McGuinness, Noel E. O'Connor, Xavier Serra

Figure 1 for Unsupervised Contrastive Learning of Sound Event Representations

Figure 2 for Unsupervised Contrastive Learning of Sound Event Representations

Figure 3 for Unsupervised Contrastive Learning of Sound Event Representations

Figure 4 for Unsupervised Contrastive Learning of Sound Event Representations

Abstract:Self-supervised representation learning can mitigate the limitations in recognition tasks with few manually labeled data but abundant unlabeled data---a common scenario in sound event research. In this work, we explore unsupervised contrastive learning as a way to learn sound event representations. To this end, we propose to use the pretext task of contrasting differently augmented views of sound events. The views are computed primarily via mixing of training examples with unrelated backgrounds, followed by other data augmentations. We analyze the main components of our method via ablation experiments. We evaluate the learned representations using linear evaluation, and in two in-domain downstream sound event classification tasks, namely, using limited manually labeled data, and using noisy labeled data. Our results suggest that unsupervised contrastive pre-training can mitigate the impact of data scarcity and increase robustness against noisy labels, outperforming supervised baselines.

* A 4-page version is submitted to ICASSP 2021

Via

Access Paper or Ask Questions