Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chenzhuang Du

Kimi-VL Technical Report

Apr 10, 2025

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei(+82 more)

Abstract:We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities - all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B). Kimi-VL demonstrates strong performance across challenging domains: as a general-purpose VLM, Kimi-VL excels in multi-turn agent tasks (e.g., OSWorld), matching flagship models. Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, OCR, mathematical reasoning, and multi-image understanding. In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several key domains. Kimi-VL also advances in processing long contexts and perceiving clearly. With a 128K extended context window, Kimi-VL can process diverse long inputs, achieving impressive scores of 64.5 on LongVideoBench and 35.1 on MMLongBench-Doc. Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost for common tasks. Building upon Kimi-VL, we introduce an advanced long-thinking variant: Kimi-VL-Thinking. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), this model exhibits strong long-horizon reasoning capabilities. It achieves scores of 61.7 on MMMU, 36.8 on MathVision, and 71.3 on MathVista while maintaining the compact 2.8B activated LLM parameters, setting a new standard for efficient multimodal thinking models. Code and models are publicly accessible at https://github.com/MoonshotAI/Kimi-VL.

Via

Access Paper or Ask Questions

What Makes for Robust Multi-Modal Models in the Face of Missing Modalities?

Oct 10, 2023

Siting Li, Chenzhuang Du, Yue Zhao, Yu Huang, Hang Zhao

Abstract:With the growing success of multi-modal learning, research on the robustness of multi-modal models, especially when facing situations with missing modalities, is receiving increased attention. Nevertheless, previous studies in this domain exhibit certain limitations, as they often lack theoretical insights or their methodologies are tied to specific network architectures or modalities. We model the scenarios of multi-modal models encountering missing modalities from an information-theoretic perspective and illustrate that the performance ceiling in such scenarios can be approached by efficiently utilizing the information inherent in non-missing modalities. In practice, there are two key aspects: (1) The encoder should be able to extract sufficiently good features from the non-missing modality; (2) The extracted features should be robust enough not to be influenced by noise during the fusion process across modalities. To this end, we introduce Uni-Modal Ensemble with Missing Modality Adaptation (UME-MMA). UME-MMA employs uni-modal pre-trained weights for the multi-modal model to enhance feature extraction and utilizes missing modality data augmentation techniques to better adapt to situations with missing modalities. Apart from that, UME-MMA, built on a late-fusion learning framework, allows for the plug-and-play use of various encoders, making it suitable for a wide range of modalities and enabling seamless integration of large-scale pre-trained encoders to further enhance performance. And we demonstrate UME-MMA's effectiveness in audio-visual datasets~(e.g., AV-MNIST, Kinetics-Sound, AVE) and vision-language datasets~(e.g., MM-IMDB, UPMC Food101).

Via

Access Paper or Ask Questions

Improving Discriminative Multi-Modal Learning with Large-Scale Pre-Trained Models

Oct 08, 2023

Chenzhuang Du, Yue Zhao, Chonghua Liao, Jiacheng You, Jie Fu, Hang Zhao

Figure 1 for Improving Discriminative Multi-Modal Learning with Large-Scale Pre-Trained Models

Figure 2 for Improving Discriminative Multi-Modal Learning with Large-Scale Pre-Trained Models

Figure 3 for Improving Discriminative Multi-Modal Learning with Large-Scale Pre-Trained Models

Figure 4 for Improving Discriminative Multi-Modal Learning with Large-Scale Pre-Trained Models

Abstract:This paper investigates how to better leverage large-scale pre-trained uni-modal models to further enhance discriminative multi-modal learning. Even when fine-tuned with only uni-modal data, these models can outperform previous multi-modal models in certain tasks. It's clear that their incorporation into multi-modal learning would significantly improve performance. However, multi-modal learning with these models still suffers from insufficient learning of uni-modal features, which weakens the resulting multi-modal model's generalization ability. While fine-tuning uni-modal models separately and then aggregating their predictions is straightforward, it doesn't allow for adequate adaptation between modalities, also leading to sub-optimal results. To this end, we introduce Multi-Modal Low-Rank Adaptation learning (MMLoRA). By freezing the weights of uni-modal fine-tuned models, adding extra trainable rank decomposition matrices to them, and subsequently performing multi-modal joint training, our method enhances adaptation between modalities and boosts overall performance. We demonstrate the effectiveness of MMLoRA on three dataset categories: audio-visual (e.g., AVE, Kinetics-Sound, CREMA-D), vision-language (e.g., MM-IMDB, UPMC Food101), and RGB-Optical Flow (UCF101).

Via

Access Paper or Ask Questions

ChatDB: Augmenting LLMs with Databases as Their Symbolic Memory

Jun 07, 2023

Chenxu Hu, Jie Fu, Chenzhuang Du, Simian Luo, Junbo Zhao, Hang Zhao

Figure 1 for ChatDB: Augmenting LLMs with Databases as Their Symbolic Memory

Figure 2 for ChatDB: Augmenting LLMs with Databases as Their Symbolic Memory

Figure 3 for ChatDB: Augmenting LLMs with Databases as Their Symbolic Memory

Figure 4 for ChatDB: Augmenting LLMs with Databases as Their Symbolic Memory

Abstract:Large language models (LLMs) with memory are computationally universal. However, mainstream LLMs are not taking full advantage of memory, and the designs are heavily influenced by biological brains. Due to their approximate nature and proneness to the accumulation of errors, conventional neural memory mechanisms cannot support LLMs to simulate complex reasoning. In this paper, we seek inspiration from modern computer architectures to augment LLMs with symbolic memory for complex multi-hop reasoning. Such a symbolic memory framework is instantiated as an LLM and a set of SQL databases, where the LLM generates SQL instructions to manipulate the SQL databases. We validate the effectiveness of the proposed memory framework on a synthetic dataset requiring complex reasoning. The project website is available at https://chatdatabase.github.io/ .

Via

Access Paper or Ask Questions

On Uni-Modal Feature Learning in Supervised Multi-Modal Learning

May 03, 2023

Chenzhuang Du, Jiaye Teng, Tingle Li, Yichen Liu, Tianyuan Yuan, Yue Wang, Yang Yuan, Hang Zhao

Figure 1 for On Uni-Modal Feature Learning in Supervised Multi-Modal Learning

Figure 2 for On Uni-Modal Feature Learning in Supervised Multi-Modal Learning

Figure 3 for On Uni-Modal Feature Learning in Supervised Multi-Modal Learning

Figure 4 for On Uni-Modal Feature Learning in Supervised Multi-Modal Learning

Abstract:We abstract the features (i.e. learned representations) of multi-modal data into 1) uni-modal features, which can be learned from uni-modal training, and 2) paired features, which can only be learned from cross-modal interactions. Multi-modal models are expected to benefit from cross-modal interactions on the basis of ensuring uni-modal feature learning. However, recent supervised multi-modal late-fusion training approaches still suffer from insufficient learning of uni-modal features on each modality. We prove that this phenomenon does hurt the model's generalization ability. To this end, we propose to choose a targeted late-fusion learning method for the given supervised multi-modal task from Uni-Modal Ensemble(UME) and the proposed Uni-Modal Teacher(UMT), according to the distribution of uni-modal and paired features. We demonstrate that, under a simple guiding strategy, we can achieve comparable results to other complex late-fusion or intermediate-fusion methods on various multi-modal datasets, including VGG-Sound, Kinetics-400, UCF101, and ModelNet40.

Via

Access Paper or Ask Questions

Intrinsically Motivated Self-supervised Learning in Reinforcement Learning

Jun 26, 2021

Yue Zhao, Chenzhuang Du, Hang Zhao, Tiejun Li

Figure 1 for Intrinsically Motivated Self-supervised Learning in Reinforcement Learning

Figure 2 for Intrinsically Motivated Self-supervised Learning in Reinforcement Learning

Figure 3 for Intrinsically Motivated Self-supervised Learning in Reinforcement Learning

Figure 4 for Intrinsically Motivated Self-supervised Learning in Reinforcement Learning

Abstract:In vision-based reinforcement learning (RL) tasks, it is prevalent to assign the auxiliary task with a surrogate self-supervised loss so as to obtain more semantic representations and improve sample efficiency. However, abundant information in self-supervised auxiliary tasks has been disregarded, since the representation learning part and the decision-making part are separated. To sufficiently utilize information in the auxiliary task, we present a simple yet effective idea to employ self-supervised loss as an intrinsic reward, called Intrinsically Motivated Self-Supervised learning in Reinforcement learning (IM-SSR). We formally show that the self-supervised loss can be decomposed as exploration for novel states and robustness improvement from nuisance elimination. IM-SSR can be effortlessly plugged into any reinforcement learning with self-supervised auxiliary objectives with nearly no additional cost. Combined with IM-SSR, the previous underlying algorithms achieve salient improvements on both sample efficiency and generalization in various vision-based robotics tasks from the DeepMind Control Suite, especially when the reward signal is sparse.

Via

Access Paper or Ask Questions

Improving Multi-Modal Learning with Uni-Modal Teachers

Jun 21, 2021

Chenzhuang Du, Tingle Li, Yichen Liu, Zixin Wen, Tianyu Hua, Yue Wang, Hang Zhao

Figure 1 for Improving Multi-Modal Learning with Uni-Modal Teachers

Figure 2 for Improving Multi-Modal Learning with Uni-Modal Teachers

Figure 3 for Improving Multi-Modal Learning with Uni-Modal Teachers

Figure 4 for Improving Multi-Modal Learning with Uni-Modal Teachers

Abstract:Learning multi-modal representations is an essential step towards real-world robotic applications, and various multi-modal fusion models have been developed for this purpose. However, we observe that existing models, whose objectives are mostly based on joint training, often suffer from learning inferior representations of each modality. We name this problem Modality Failure, and hypothesize that the imbalance of modalities and the implicit bias of common objectives in fusion method prevent encoders of each modality from sufficient feature learning. To this end, we propose a new multi-modal learning method, Uni-Modal Teacher, which combines the fusion objective and uni-modal distillation to tackle the modality failure problem. We show that our method not only drastically improves the representation of each modality, but also improves the overall multi-modal task performance. Our method can be effectively generalized to most multi-modal fusion approaches. We achieve more than 3% improvement on the VGGSound audio-visual classification task, as well as improving performance on the NYU depth V2 RGB-D image segmentation task.

Via

Access Paper or Ask Questions

What Makes Multimodal Learning Better than Single (Provably)

Jun 08, 2021

Yu Huang, Chenzhuang Du, Zihui Xue, Xuanyao Chen, Hang Zhao, Longbo Huang

Figure 1 for What Makes Multimodal Learning Better than Single (Provably)

Figure 2 for What Makes Multimodal Learning Better than Single (Provably)

Figure 3 for What Makes Multimodal Learning Better than Single (Provably)

Figure 4 for What Makes Multimodal Learning Better than Single (Provably)

Abstract:The world provides us with data of multiple modalities. Intuitively, models fusingdata from different modalities outperform unimodal models, since more informationis aggregated. Recently, joining the success of deep learning, there is an influentialline of work on deep multimodal learning, which has remarkable empirical resultson various applications. However, theoretical justifications in this field are notablylacking.Can multimodal provably perform better than unimodal? In this paper, we answer this question under a most popular multimodal learningframework, which firstly encodes features from different modalities into a commonlatent space and seamlessly maps the latent representations into the task space. Weprove that learning with multiple modalities achieves a smaller population risk thanonly using its subset of modalities. The main intuition is that the former has moreaccurate estimate of the latent space representation. To the best of our knowledge,this is the first theoretical treatment to capture important qualitative phenomenaobserved in real multimodal applications. Combining with experiment results, weshow that multimodal learning does possess an appealing formal guarantee.

* 15 pages, 2 figures

Via

Access Paper or Ask Questions

Secure Data Sharing With Flow Model

Sep 24, 2020

Chenwei Wu, Chenzhuang Du, Yang Yuan

Figure 1 for Secure Data Sharing With Flow Model

Figure 2 for Secure Data Sharing With Flow Model

Figure 3 for Secure Data Sharing With Flow Model

Figure 4 for Secure Data Sharing With Flow Model

Abstract:In the classical multi-party computation setting, multiple parties jointly compute a function without revealing their own input data. We consider a variant of this problem, where the input data can be shared for machine learning training purposes, but the data are also encrypted so that they cannot be recovered by other parties. We present a rotation based method using flow model, and theoretically justified its security. We demonstrate the effectiveness of our method in different scenarios, including supervised secure model training, and unsupervised generative model training. Our code is available at https://github.com/ duchenzhuang/flowencrypt.

Via

Access Paper or Ask Questions