Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yuzi Yan

Kimi-VL Technical Report

Apr 10, 2025

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei(+82 more)

Abstract:We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities - all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B). Kimi-VL demonstrates strong performance across challenging domains: as a general-purpose VLM, Kimi-VL excels in multi-turn agent tasks (e.g., OSWorld), matching flagship models. Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, OCR, mathematical reasoning, and multi-image understanding. In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several key domains. Kimi-VL also advances in processing long contexts and perceiving clearly. With a 128K extended context window, Kimi-VL can process diverse long inputs, achieving impressive scores of 64.5 on LongVideoBench and 35.1 on MMLongBench-Doc. Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost for common tasks. Building upon Kimi-VL, we introduce an advanced long-thinking variant: Kimi-VL-Thinking. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), this model exhibits strong long-horizon reasoning capabilities. It achieves scores of 61.7 on MMMU, 36.8 on MathVision, and 71.3 on MathVista while maintaining the compact 2.8B activated LLM parameters, setting a new standard for efficient multimodal thinking models. Code and models are publicly accessible at https://github.com/MoonshotAI/Kimi-VL.

Via

Access Paper or Ask Questions

Boosting Deductive Reasoning with Step Signals In RLHF

Oct 12, 2024

Jialian Li, Yipin Zhang, Wei Shen, Yuzi Yan, Jian Xie, Dong Yan

Abstract:Logical reasoning is a crucial task for Large Language Models (LLMs), enabling them to tackle complex problems. Among reasoning tasks, multi-step reasoning poses a particular challenge. Grounded in the theory of formal logic, we have developed an automated method, Multi-step Deduction (MuseD), for deductive reasoning data. MuseD has allowed us to create training and testing datasets for multi-step reasoning. Our generation method enables control over the complexity of the generated instructions, facilitating training and evaluation of models across different difficulty levels. Through RLHF training, our training data has demonstrated significant improvements in logical capabilities for both in-domain of out-of-domain reasoning tasks. Additionally, we have conducted tests to assess the multi-step reasoning abilities of various models.

Via

Access Paper or Ask Questions

Uncertainty-aware Reward Model: Teaching Reward Models to Know What is Unknown

Oct 01, 2024

Xingzhou Lou, Dong Yan, Wei Shen, Yuzi Yan, Jian Xie, Junge Zhang

Abstract:Reward models (RM) play a critical role in aligning generations of large language models (LLM) to human expectations. However, prevailing RMs fail to capture the stochasticity within human preferences and cannot effectively evaluate the reliability of reward predictions. To address these issues, we propose Uncertain-aware RM (URM) and Uncertain-aware RM Ensemble (URME) to incorporate and manage uncertainty in reward modeling. URM can model the distribution of disentangled attributes within human preferences, while URME quantifies uncertainty through discrepancies in the ensemble, thereby identifying potential lack of knowledge during reward evaluation. Experiment results indicate that the proposed URM achieves state-of-the-art performance compared to models with the same size, demonstrating the effectiveness of modeling uncertainty within human preferences. Furthermore, empirical results show that through uncertainty quantification, URM and URME can identify unreliable predictions to improve the quality of reward evaluations.

Via

Access Paper or Ask Questions

3D-Properties: Identifying Challenges in DPO and Charting a Path Forward

Jun 11, 2024

Yuzi Yan, Yibo Miao, Jialian Li, Yipin Zhang, Jian Xie, Zhijie Deng, Dong Yan

Figure 1 for 3D-Properties: Identifying Challenges in DPO and Charting a Path Forward

Figure 2 for 3D-Properties: Identifying Challenges in DPO and Charting a Path Forward

Figure 3 for 3D-Properties: Identifying Challenges in DPO and Charting a Path Forward

Figure 4 for 3D-Properties: Identifying Challenges in DPO and Charting a Path Forward

Abstract:Aligning large language models (LLMs) with human preference has recently gained tremendous attention, with the canonical yet costly RLHF-PPO and the simple and straightforward Direct Preference Optimization (DPO) as two examples. Despite the efficiency, DPO has rarely be used in the state-of-the-art production-level LLMs, implying its potential pathologies. In this work, we revisit DPO with a comprehensive examination of its empirical efficacy and a systematic comparison with RLHF-PPO. We identify the \textbf{3D}-properties of DPO's learning outcomes: the \textbf{D}rastic drop in the likelihood of rejected responses, the \textbf{D}egradation into LLM unlearning, and the \textbf{D}ispersion effect on unseen responses through experiments with both a carefully designed toy model and practical LLMs on tasks including mathematical problem-solving and instruction following. These findings inherently connect to some observations made by related works and we additionally contribute a plausible theoretical explanation for them. Accordingly, we propose easy regularization methods to mitigate the issues caused by \textbf{3D}-properties, improving the training stability and final performance of DPO. Our contributions also include an investigation into how the distribution of the paired preference data impacts the effectiveness of DPO. We hope this work could offer research directions to narrow the gap between reward-free preference learning methods and reward-based ones.

Via

Access Paper or Ask Questions

Exploring the LLM Journey from Cognition to Expression with Linear Representations

May 27, 2024

Yuzi Yan, Jialian Li, Yipin Zhang, Dong Yan

Figure 1 for Exploring the LLM Journey from Cognition to Expression with Linear Representations

Figure 2 for Exploring the LLM Journey from Cognition to Expression with Linear Representations

Figure 3 for Exploring the LLM Journey from Cognition to Expression with Linear Representations

Figure 4 for Exploring the LLM Journey from Cognition to Expression with Linear Representations

Abstract:This paper presents an in-depth examination of the evolution and interplay of cognitive and expressive capabilities in large language models (LLMs), with a specific focus on Baichuan-7B and Baichuan-33B, an advanced bilingual (Chinese and English) LLM series. We define and explore the model's cognitive and expressive capabilities through linear representations across three critical phases: Pretraining, Supervised Fine-Tuning (SFT), and Reinforcement Learning from Human Feedback (RLHF). Cognitive capability is defined as the quantity and quality of information conveyed by the neuron output vectors within the network, similar to the neural signal processing in human cognition. Expressive capability is defined as the model's capability to produce word-level output. Our findings unveil a sequential development pattern, where cognitive abilities are largely established during Pretraining, whereas expressive abilities predominantly advance during SFT and RLHF. Statistical analyses confirm a significant correlation between the two capabilities, suggesting that cognitive capacity may limit expressive potential. The paper also explores the theoretical underpinnings of these divergent developmental trajectories and their connection to the LLMs' architectural design. Moreover, we evaluate various optimization-independent strategies, such as few-shot learning and repeated sampling, which bridge the gap between cognitive and expressive capabilities. This research reveals the potential connection between the hidden space and the output space, contributing valuable insights into the interpretability and controllability of their training processes.

* Published in ICML 2024

Via

Access Paper or Ask Questions

Distributed Policy Gradient for Linear Quadratic Networked Control with Limited Communication Range

Mar 05, 2024

Yuzi Yan, Yuan Shen

Figure 1 for Distributed Policy Gradient for Linear Quadratic Networked Control with Limited Communication Range

Figure 2 for Distributed Policy Gradient for Linear Quadratic Networked Control with Limited Communication Range

Figure 3 for Distributed Policy Gradient for Linear Quadratic Networked Control with Limited Communication Range

Figure 4 for Distributed Policy Gradient for Linear Quadratic Networked Control with Limited Communication Range

Abstract:This paper proposes a scalable distributed policy gradient method and proves its convergence to near-optimal solution in multi-agent linear quadratic networked systems. The agents engage within a specified network under local communication constraints, implying that each agent can only exchange information with a limited number of neighboring agents. On the underlying graph of the network, each agent implements its control input depending on its nearby neighbors' states in the linear quadratic control setting. We show that it is possible to approximate the exact gradient only using local information. Compared with the centralized optimal controller, the performance gap decreases to zero exponentially as the communication and control ranges increase. We also demonstrate how increasing the communication range enhances system stability in the gradient descent process, thereby elucidating a critical trade-off. The simulation results verify our theoretical findings.

* 14 pages, 6 figures

Via

Access Paper or Ask Questions

Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech

Mar 31, 2022

Guangyan Zhang, Kaitao Song, Xu Tan, Daxin Tan, Yuzi Yan, Yanqing Liu, Gang Wang, Wei Zhou, Tao Qin, Tan Lee(+1 more)

Figure 1 for Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech

Figure 2 for Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech

Figure 3 for Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech

Figure 4 for Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech

Abstract:Recently, leveraging BERT pre-training to improve the phoneme encoder in text to speech (TTS) has drawn increasing attention. However, the works apply pre-training with character-based units to enhance the TTS phoneme encoder, which is inconsistent with the TTS fine-tuning that takes phonemes as input. Pre-training only with phonemes as input can alleviate the input mismatch but lack the ability to model rich representations and semantic information due to limited phoneme vocabulary. In this paper, we propose MixedPhoneme BERT, a novel variant of the BERT model that uses mixed phoneme and sup-phoneme representations to enhance the learning capability. Specifically, we merge the adjacent phonemes into sup-phonemes and combine the phoneme sequence and the merged sup-phoneme sequence as the model input, which can enhance the model capacity to learn rich contextual representations. Experiment results demonstrate that our proposed Mixed-Phoneme BERT significantly improves the TTS performance with 0.30 CMOS gain compared with the FastSpeech 2 baseline. The Mixed-Phoneme BERT achieves 3x inference speedup and similar voice quality to the previous TTS pre-trained model PnG BERT

* submitted to interspeech 2022

Via

Access Paper or Ask Questions

Relative Distributed Formation and Obstacle Avoidance with Multi-agent Reinforcement Learning

Nov 14, 2021

Yuzi Yan, Xiaoxiang Li, Xinyou Qiu, Jiantao Qiu, Jian Wang, Yu Wang, Yuan Shen

Figure 1 for Relative Distributed Formation and Obstacle Avoidance with Multi-agent Reinforcement Learning

Figure 2 for Relative Distributed Formation and Obstacle Avoidance with Multi-agent Reinforcement Learning

Figure 3 for Relative Distributed Formation and Obstacle Avoidance with Multi-agent Reinforcement Learning

Figure 4 for Relative Distributed Formation and Obstacle Avoidance with Multi-agent Reinforcement Learning

Abstract:Multi-agent formation as well as obstacle avoidance is one of the most actively studied topics in the field of multi-agent systems. Although some classic controllers like model predictive control (MPC) and fuzzy control achieve a certain measure of success, most of them require precise global information which is not accessible in harsh environments. On the other hand, some reinforcement learning (RL) based approaches adopt the leader-follower structure to organize different agents' behaviors, which sacrifices the collaboration between agents thus suffering from bottlenecks in maneuverability and robustness. In this paper, we propose a distributed formation and obstacle avoidance method based on multi-agent reinforcement learning (MARL). Agents in our system only utilize local and relative information to make decisions and control themselves distributively. Agent in the multi-agent system will reorganize themselves into a new topology quickly in case that any of them is disconnected. Our method achieves better performance regarding formation error, formation convergence rate and on-par success rate of obstacle avoidance compared with baselines (both classic control methods and another RL-based method). The feasibility of our method is verified by both simulation and hardware implementation with Ackermann-steering vehicles.

Via

Access Paper or Ask Questions

Full Attention Bidirectional Deep Learning Structure for Single Channel Speech Enhancement

Aug 27, 2021

Yuzi Yan, Wei-Qiang Zhang, Michael T. Johnson

Figure 1 for Full Attention Bidirectional Deep Learning Structure for Single Channel Speech Enhancement

Figure 2 for Full Attention Bidirectional Deep Learning Structure for Single Channel Speech Enhancement

Figure 3 for Full Attention Bidirectional Deep Learning Structure for Single Channel Speech Enhancement

Figure 4 for Full Attention Bidirectional Deep Learning Structure for Single Channel Speech Enhancement

Abstract:As the cornerstone of other important technologies, such as speech recognition and speech synthesis, speech enhancement is a critical area in audio signal processing. In this paper, a new deep learning structure for speech enhancement is demonstrated. The model introduces a "full" attention mechanism to a bidirectional sequence-to-sequence method to make use of latent information after each focal frame. This is an extension of the previous attention-based RNN method. The proposed bidirectional attention-based architecture achieves better performance in terms of speech quality (PESQ), compared with OM-LSA, CNN-LSTM, T-GSA and the unidirectional attention-based LSTM baseline.

* 4 pages

Via

Access Paper or Ask Questions

AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style

Jul 06, 2021

Yuzi Yan, Xu Tan, Bohan Li, Guangyan Zhang, Tao Qin, Sheng Zhao, Yuan Shen, Wei-Qiang Zhang, Tie-Yan Liu

Figure 1 for AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style

Figure 2 for AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style

Figure 3 for AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style

Figure 4 for AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style

Abstract:While recent text to speech (TTS) models perform very well in synthesizing reading-style (e.g., audiobook) speech, it is still challenging to synthesize spontaneous-style speech (e.g., podcast or conversation), mainly because of two reasons: 1) the lack of training data for spontaneous speech; 2) the difficulty in modeling the filled pauses (um and uh) and diverse rhythms in spontaneous speech. In this paper, we develop AdaSpeech 3, an adaptive TTS system that fine-tunes a well-trained reading-style TTS model for spontaneous-style speech. Specifically, 1) to insert filled pauses (FP) in the text sequence appropriately, we introduce an FP predictor to the TTS model; 2) to model the varying rhythms, we introduce a duration predictor based on mixture of experts (MoE), which contains three experts responsible for the generation of fast, medium and slow speech respectively, and fine-tune it as well as the pitch predictor for rhythm adaptation; 3) to adapt to other speaker timbre, we fine-tune some parameters in the decoder with few speech data. To address the challenge of lack of training data, we mine a spontaneous speech dataset to support our research this work and facilitate future research on spontaneous TTS. Experiments show that AdaSpeech 3 synthesizes speech with natural FP and rhythms in spontaneous styles, and achieves much better MOS and SMOS scores than previous adaptive TTS systems.

* Accepted by INTERSPEECH 2021

Via

Access Paper or Ask Questions