Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Daniel Miranda

Efficient Self-Improvement in Multimodal Large Language Models: A Model-Level Judge-Free Approach

Nov 26, 2024

Shijian Deng, Wentian Zhao, Yu-Jhe Li, Kun Wan, Daniel Miranda, Ajinkya Kale, Yapeng Tian

Figure 1 for Efficient Self-Improvement in Multimodal Large Language Models: A Model-Level Judge-Free Approach

Figure 2 for Efficient Self-Improvement in Multimodal Large Language Models: A Model-Level Judge-Free Approach

Figure 3 for Efficient Self-Improvement in Multimodal Large Language Models: A Model-Level Judge-Free Approach

Figure 4 for Efficient Self-Improvement in Multimodal Large Language Models: A Model-Level Judge-Free Approach

Abstract:Self-improvement in multimodal large language models (MLLMs) is crucial for enhancing their reliability and robustness. However, current methods often rely heavily on MLLMs themselves as judges, leading to high computational costs and potential pitfalls like reward hacking and model collapse. This paper introduces a novel, model-level judge-free self-improvement framework. Our approach employs a controlled feedback mechanism while eliminating the need for MLLMs in the verification loop. We generate preference learning pairs using a controllable hallucination mechanism and optimize data quality by leveraging lightweight, contrastive language-image encoders to evaluate and reverse pairs when necessary. Evaluations across public benchmarks and our newly introduced IC dataset designed to challenge hallucination control demonstrate that our model outperforms conventional techniques. We achieve superior precision and recall with significantly lower computational demands. This method offers an efficient pathway to scalable self-improvement in MLLMs, balancing performance gains with reduced resource requirements.

Via

Access Paper or Ask Questions

Quadratic Is Not What You Need For Multimodal Large Language Models

Oct 08, 2024

Phu Pham, Wentian Zhao, Kun Wan, Yu-Jhe Li, Zeliang Zhang, Daniel Miranda, Ajinkya Kale, Chenliang Xu

Figure 1 for Quadratic Is Not What You Need For Multimodal Large Language Models

Figure 2 for Quadratic Is Not What You Need For Multimodal Large Language Models

Figure 3 for Quadratic Is Not What You Need For Multimodal Large Language Models

Figure 4 for Quadratic Is Not What You Need For Multimodal Large Language Models

Abstract:In the past year, the capabilities of Multimodal Large Language Models (MLLMs) have significantly improved across various aspects. However, constrained by the quadratic growth of computation in LLMs as the number of tokens increases, efficiency has become a bottleneck for further scaling MLLMs. Although recent efforts have been made to prune visual tokens or use more lightweight LLMs to reduce computation, the problem of quadratic growth in computation with the increase of visual tokens still persists. To address this, we propose a novel approach: instead of reducing the input visual tokens for LLMs, we focus on pruning vision-related computations within the LLMs. After pruning, the computation growth in the LLM is no longer quadratic with the increase of visual tokens, but linear. Surprisingly, we found that after applying such extensive pruning, the capabilities of MLLMs are comparable with the original one and even superior on some benchmarks with only 25% of the computation. This finding opens up the possibility for MLLMs to incorporate much denser visual tokens. Additionally, based on this finding, we further analyzed some architectural design deficiencies in existing MLLMs and proposed promising improvements. To the best of our knowledge, this is the first study to investigate the computational redundancy in the LLM's vision component of MLLMs. Code and checkpoints will be released soon.

Via

Access Paper or Ask Questions