Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Aman Kumar Singh

Improved Alignment of Modalities in Large Vision Language Models

Mar 25, 2025

Kartik Jangra, Aman Kumar Singh, Yashwani Mann, Geetanjali Rathee

Figure 1 for Improved Alignment of Modalities in Large Vision Language Models

Figure 2 for Improved Alignment of Modalities in Large Vision Language Models

Figure 3 for Improved Alignment of Modalities in Large Vision Language Models

Figure 4 for Improved Alignment of Modalities in Large Vision Language Models

Abstract:Recent advancements in vision-language models have achieved remarkable results in making language models understand vision inputs. However, a unified approach to align these models across diverse tasks such as image captioning and visual question answering remains a challenge. Existing methods either require very big language models or very big datasets which is not efficient in utilizing existing models. This paper addresses this gap and devises a training strategy of auto-regressive vision-language models, to unify vision-language tasks like image-captioning and visual question answering. We propose four training stages for aligning the vision model with the language model, in other words, the language model is given an ability to process visual inputs. We also devise different attention masks for training transformer-based language models that improve the quality of visual features. Further, we introduce some findings, 1) the attention mask should not be applied on visual inputs, 2) the Language model converges faster on AI- generated data, 3) More work should be done in the alignment stage during the pre-training of the model, 4) the model can easily adapt to any downstream tasks like visual question answering on healthcare datasets like PathVQA. After training the model for one epoch for all the stages, it outperforms large models like VILA-13 billion models on common benchmarks like CIDEr scores on COCO and Flickr30k datasets and achieves very close scores to GIT-2 on the same dataset despite being a much smaller model trained on a much smaller dataset. All of the training is done using best practices available like multi- GPU parallel training, lower-precision training with 16-bit float numbers, faster attention (SDPA), and gradient accumulation, and completed the training within 12 hours.

Via

Access Paper or Ask Questions

Video Summarization: Study of various techniques

Jan 21, 2021

Ravi Raj, Varad Bhatnagar, Aman Kumar Singh, Sneha Mane, Nilima Walde

Abstract:A comparative study of various techniques which can be used for summarization of Videos i.e. Video to Video conversion is presented along with respective architecture, results, strengths and shortcomings. In all approaches, a lengthy video is converted into a shorter video which aims to capture all important events that are present in the original video. The definition of 'important event' may vary according to the context, such as a sports video and a documentary may have different events which are classified as important.

* Video Summarization: Study of Various Techniques Proceedings of IRAJ International Conference, 26th May, 2019, Pune, India

Via

Access Paper or Ask Questions