Abstract:Recent developments of multi-modal large language models have demonstrated its strong ability in solving vision-language tasks. In this paper, we focus on the product understanding task, which plays an essential role in enhancing online shopping experience. Product understanding task includes a variety of sub-tasks, which require models to respond diverse queries based on multi-modal product information. Traditional methods design distinct model architectures for each sub-task. On the contrary, we present PUMGPT, a large vision-language model aims at unifying all product understanding tasks under a singular model structure. To bridge the gap between vision and text representations, we propose Layer-wise Adapters (LA), an approach that provides enhanced alignment with fewer visual tokens and enables parameter-efficient fine-tuning. Moreover, the inherent parameter-efficient fine-tuning ability allows PUMGPT to be readily adapted to new product understanding tasks and emerging products. We design instruction templates to generate diverse product instruction datasets. Simultaneously, we utilize open-domain datasets during training to improve the performance of PUMGPT and its generalization ability. Through extensive evaluations, PUMGPT demonstrates its superior performance across multiple product understanding tasks, including product captioning, category question-answering, attribute extraction, attribute question-answering, and even free-form question-answering about products.
Abstract:In cross-domain few-shot learning, the core issue is that the model trained on source tasks from source domains can not generalize well to target tasks from the target domain, especially when the domain shift is very large. Motivated by the observation that the domain shift between training tasks and target tasks usually can reflect in their style variation, we propose Task Augmented Meta-Learning (TAML) to conduct style transfer-based task augmentation to improve the domain generalization ability. Firstly, Multi-task Interpolation (MTI) is introduced to perform feature fusion on tasks from different tasks with different styles, which makes more diverse styles available. Furthermore, a novel task-augmentation strategy called Multi-Task Style Transfer (MTST) is put forward to perform style transfer on existing tasks to learn discriminative style-independent features. At last, we introduce Feature Modulation module (FM) to add random styles, which aims to improve the generalization of our model. The proposed TAML increases the diversity of styles of training tasks, and contributes to training a model with better domain generalization ability. The effectiveness is demonstrated via theoretical analysis and thorough experiments on two popular cross-domain few-shot benchmarks.