Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Gerardo Aragon Camarasa

Masked Generative Policy for Robotic Control

Dec 09, 2025

Lipeng Zhuang, Shiyu Fan, Florent P. Audonnet, Yingdong Ru, Gerardo Aragon Camarasa, Paul Henderson

Figure 1 for Masked Generative Policy for Robotic Control

Figure 2 for Masked Generative Policy for Robotic Control

Figure 3 for Masked Generative Policy for Robotic Control

Figure 4 for Masked Generative Policy for Robotic Control

Abstract:We present Masked Generative Policy (MGP), a novel framework for visuomotor imitation learning. We represent actions as discrete tokens, and train a conditional masked transformer that generates tokens in parallel and then rapidly refines only low-confidence tokens. We further propose two new sampling paradigms: MGP-Short, which performs parallel masked generation with score-based refinement for Markovian tasks, and MGP-Long, which predicts full trajectories in a single pass and dynamically refines low-confidence action tokens based on new observations. With globally coherent prediction and robust adaptive execution capabilities, MGP-Long enables reliable control on complex and non-Markovian tasks that prior methods struggle with. Extensive evaluations on 150 robotic manipulation tasks spanning the Meta-World and LIBERO benchmarks show that MGP achieves both rapid inference and superior success rates compared to state-of-the-art diffusion and autoregressive policies. Specifically, MGP increases the average success rate by 9% across 150 tasks while cutting per-sequence inference time by up to 35x. It further improves the average success rate by 60% in dynamic and missing-observation environments, and solves two non-Markovian scenarios where other state-of-the-art methods fail.

Via

Access Paper or Ask Questions

Understanding and Mitigating Human-Labelling Errors in Supervised Contrastive Learning

Mar 10, 2024

Zijun Long, Lipeng Zhuang, George Killick, Richard McCreadie, Gerardo Aragon Camarasa, Paul Henderson

Figure 1 for Understanding and Mitigating Human-Labelling Errors in Supervised Contrastive Learning

Figure 2 for Understanding and Mitigating Human-Labelling Errors in Supervised Contrastive Learning

Figure 3 for Understanding and Mitigating Human-Labelling Errors in Supervised Contrastive Learning

Figure 4 for Understanding and Mitigating Human-Labelling Errors in Supervised Contrastive Learning

Abstract:Human-annotated vision datasets inevitably contain a fraction of human mislabelled examples. While the detrimental effects of such mislabelling on supervised learning are well-researched, their influence on Supervised Contrastive Learning (SCL) remains largely unexplored. In this paper, we show that human-labelling errors not only differ significantly from synthetic label errors, but also pose unique challenges in SCL, different to those in traditional supervised learning methods. Specifically, our results indicate they adversely impact the learning process in the ~99% of cases when they occur as false positive samples. Existing noise-mitigating methods primarily focus on synthetic label errors and tackle the unrealistic setting of very high synthetic noise rates (40-80%), but they often underperform on common image datasets due to overfitting. To address this issue, we introduce a novel SCL objective with robustness to human-labelling errors, SCL-RHE. SCL-RHE is designed to mitigate the effects of real-world mislabelled examples, typically characterized by much lower noise rates (<5%). We demonstrate that SCL-RHE consistently outperforms state-of-the-art representation learning and noise-mitigating methods across various vision benchmarks, by offering improved resilience against human-labelling errors.

* arXiv admin note: substantial text overlap with arXiv:2311.16481

Via

Access Paper or Ask Questions

Elucidating and Overcoming the Challenges of Label Noise in Supervised Contrastive Learning

Nov 25, 2023

Zijun Long, George Killick, Lipeng Zhuang, Richard McCreadie, Gerardo Aragon Camarasa, Paul Henderson

Abstract:Image classification datasets exhibit a non-negligible fraction of mislabeled examples, often due to human error when one class superficially resembles another. This issue poses challenges in supervised contrastive learning (SCL), where the goal is to cluster together data points of the same class in the embedding space while distancing those of disparate classes. While such methods outperform those based on cross-entropy, they are not immune to labeling errors. However, while the detrimental effects of noisy labels in supervised learning are well-researched, their influence on SCL remains largely unexplored. Hence, we analyse the effect of label errors and examine how they disrupt the SCL algorithm's ability to distinguish between positive and negative sample pairs. Our analysis reveals that human labeling errors manifest as easy positive samples in around 99% of cases. We, therefore, propose D-SCL, a novel Debiased Supervised Contrastive Learning objective designed to mitigate the bias introduced by labeling errors. We demonstrate that D-SCL consistently outperforms state-of-the-art techniques for representation learning across diverse vision benchmarks, offering improved robustness to label errors.

Via

Access Paper or Ask Questions

RoboLLM: Robotic Vision Tasks Grounded on Multimodal Large Language Models

Oct 16, 2023

Zijun Long, George Killick, Richard McCreadie, Gerardo Aragon Camarasa

Figure 1 for RoboLLM: Robotic Vision Tasks Grounded on Multimodal Large Language Models

Figure 2 for RoboLLM: Robotic Vision Tasks Grounded on Multimodal Large Language Models

Figure 3 for RoboLLM: Robotic Vision Tasks Grounded on Multimodal Large Language Models

Figure 4 for RoboLLM: Robotic Vision Tasks Grounded on Multimodal Large Language Models

Abstract:Robotic vision applications often necessitate a wide range of visual perception tasks, such as object detection, segmentation, and identification. While there have been substantial advances in these individual tasks, integrating specialized models into a unified vision pipeline presents significant engineering challenges and costs. Recently, Multimodal Large Language Models (MLLMs) have emerged as novel backbones for various downstream tasks. We argue that leveraging the pre-training capabilities of MLLMs enables the creation of a simplified framework, thus mitigating the need for task-specific encoders. Specifically, the large-scale pretrained knowledge in MLLMs allows for easier fine-tuning to downstream robotic vision tasks and yields superior performance. We introduce the RoboLLM framework, equipped with a BEiT-3 backbone, to address all visual perception tasks in the ARMBench challenge-a large-scale robotic manipulation dataset about real-world warehouse scenarios. RoboLLM not only outperforms existing baselines but also substantially reduces the engineering burden associated with model selection and tuning. The source code is publicly available at https://github.com/longkukuhi/armbench.

Via

Access Paper or Ask Questions

MultiWay-Adapater: Adapting large-scale multi-modal models for scalable image-text retrieval

Sep 12, 2023

Zijun Long, George Killick, Richard McCreadie, Gerardo Aragon Camarasa

Figure 1 for MultiWay-Adapater: Adapting large-scale multi-modal models for scalable image-text retrieval

Figure 2 for MultiWay-Adapater: Adapting large-scale multi-modal models for scalable image-text retrieval

Figure 3 for MultiWay-Adapater: Adapting large-scale multi-modal models for scalable image-text retrieval

Figure 4 for MultiWay-Adapater: Adapting large-scale multi-modal models for scalable image-text retrieval

Abstract:As the size of Large Multi-Modal Models (LMMs) increases consistently, the adaptation of these pre-trained models to specialized tasks has become a computationally and memory-intensive challenge. Traditional fine-tuning methods require isolated, exhaustive retuning for each new task, limiting the models' versatility. Moreover, current efficient adaptation techniques often overlook modality alignment, focusing only on the knowledge extraction of new tasks. To tackle these issues, we introduce Multiway-Adapter, an innovative framework incorporating an 'Alignment Enhancer' to deepen modality alignment, enabling high transferability without tuning pre-trained parameters. Our method adds fewer than 1.25\% of additional parameters to LMMs, exemplified by the BEiT-3 model in our study. This leads to superior zero-shot image-text retrieval performance compared to fully fine-tuned models, while achieving up to a 57\% reduction in fine-tuning time. Our approach offers a resource-efficient and effective adaptation pathway for LMMs, broadening their applicability. The source code is publicly available at: \url{https://github.com/longkukuhi/MultiWay-Adapter}.

Via

Access Paper or Ask Questions

When hard negative sampling meets supervised contrastive learning

Aug 28, 2023

Zijun Long, George Killick, Richard McCreadie, Gerardo Aragon Camarasa, Zaiqiao Meng

Figure 1 for When hard negative sampling meets supervised contrastive learning

Figure 2 for When hard negative sampling meets supervised contrastive learning

Figure 3 for When hard negative sampling meets supervised contrastive learning

Figure 4 for When hard negative sampling meets supervised contrastive learning

Abstract:State-of-the-art image models predominantly follow a two-stage strategy: pre-training on large datasets and fine-tuning with cross-entropy loss. Many studies have shown that using cross-entropy can result in sub-optimal generalisation and stability. While the supervised contrastive loss addresses some limitations of cross-entropy loss by focusing on intra-class similarities and inter-class differences, it neglects the importance of hard negative mining. We propose that models will benefit from performance improvement by weighting negative samples based on their dissimilarity to positive counterparts. In this paper, we introduce a new supervised contrastive learning objective, SCHaNe, which incorporates hard negative sampling during the fine-tuning phase. Without requiring specialized architectures, additional data, or extra computational resources, experimental results indicate that SCHaNe outperforms the strong baseline BEiT-3 in Top-1 accuracy across various benchmarks, with significant gains of up to $3.32\%$ in few-shot learning settings and $3.41\%$ in full dataset fine-tuning. Importantly, our proposed objective sets a new state-of-the-art for base models on ImageNet-1k, achieving an 86.14\% accuracy. Furthermore, we demonstrate that the proposed objective yields better embeddings and explains the improved effectiveness observed in our experiments.

Via

Access Paper or Ask Questions

LaCViT: A Label-aware Contrastive Training Framework for Vision Transformers

Mar 31, 2023

Zijun Long, Zaiqiao Meng, Gerardo Aragon Camarasa, Richard McCreadie

Figure 1 for LaCViT: A Label-aware Contrastive Training Framework for Vision Transformers

Figure 2 for LaCViT: A Label-aware Contrastive Training Framework for Vision Transformers

Figure 3 for LaCViT: A Label-aware Contrastive Training Framework for Vision Transformers

Figure 4 for LaCViT: A Label-aware Contrastive Training Framework for Vision Transformers

Abstract:Vision Transformers have been incredibly effective when tackling computer vision tasks due to their ability to model long feature dependencies. By using large-scale training data and various self-supervised signals (e.g., masked random patches), vision transformers provide state-of-the-art performance on several benchmarking datasets, such as ImageNet-1k and CIFAR-10. However, these vision transformers pretrained over general large-scale image corpora could only produce an anisotropic representation space, limiting their generalizability and transferability to the target downstream tasks. In this paper, we propose a simple and effective Label-aware Contrastive Training framework LaCViT, which improves the isotropy of the pretrained representation space for vision transformers, thereby enabling more effective transfer learning amongst a wide range of image classification tasks. Through experimentation over five standard image classification datasets, we demonstrate that LaCViT-trained models outperform the original pretrained baselines by around 9% absolute Accuracy@1, and consistent improvements can be observed when applying LaCViT to our three evaluated vision transformers.

Via

Access Paper or Ask Questions