Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kunhao Zheng

Optimizing Language Models for Inference Time Objectives using Reinforcement Learning

Mar 25, 2025

Yunhao Tang, Kunhao Zheng, Gabriel Synnaeve, Rémi Munos

Abstract:In this work, we investigate the merits of explicitly optimizing for inference time algorithmic performance during model training. We show how optimizing for inference time performance can improve overall model efficacy. We consider generic inference time objectives with $k$ samples, with a focus on pass@$k$ and majority voting as two main applications. With language model training on reasoning datasets, we showcase the performance trade-off enabled by training with such objectives. When training on code generation tasks, we show that the approach significantly improves pass@$k$ objectives compared to the baseline method.

Via

Access Paper or Ask Questions

The KoLMogorov Test: Compression by Code Generation

Mar 18, 2025

Ori Yoran, Kunhao Zheng, Fabian Gloeckle, Jonas Gehring, Gabriel Synnaeve, Taco Cohen

Abstract:Compression is at the heart of intelligence. A theoretically optimal way to compress any sequence of data is to find the shortest program that outputs that sequence and then halts. However, such 'Kolmogorov compression' is uncomputable, and code generating LLMs struggle to approximate this theoretical ideal, as it requires reasoning, planning and search capabilities beyond those of current models. In this work, we introduce the KoLMogorov-Test (KT), a compression-as-intelligence test for code generating LLMs. In KT a model is presented with a sequence of data at inference time, and asked to generate the shortest program that produces the sequence. We identify several benefits of KT for both evaluation and training: an essentially infinite number of problem instances of varying difficulty is readily available, strong baselines already exist, the evaluation metric (compression) cannot be gamed, and pretraining data contamination is highly unlikely. To evaluate current models, we use audio, text, and DNA data, as well as sequences produced by random synthetic programs. Current flagship models perform poorly - both GPT4-o and Llama-3.1-405B struggle on our natural and synthetic sequences. On our synthetic distribution, we are able to train code generation models with lower compression rates than previous approaches. Moreover, we show that gains on synthetic data generalize poorly to real data, suggesting that new innovations are necessary for additional gains on KT.

Via

Access Paper or Ask Questions

Soft Policy Optimization: Online Off-Policy RL for Sequence Models

Mar 07, 2025

Taco Cohen, David W. Zhang, Kunhao Zheng, Yunhao Tang, Remi Munos, Gabriel Synnaeve

Figure 1 for Soft Policy Optimization: Online Off-Policy RL for Sequence Models

Figure 2 for Soft Policy Optimization: Online Off-Policy RL for Sequence Models

Abstract:RL-based post-training of language models is almost exclusively done using on-policy methods such as PPO. These methods cannot learn from arbitrary sequences such as those produced earlier in training, in earlier runs, by human experts or other policies, or by decoding and exploration methods. This results in severe sample inefficiency and exploration difficulties, as well as a potential loss of diversity in the policy responses. Moreover, asynchronous PPO implementations require frequent and costly model transfers, and typically use value models which require a large amount of memory. In this paper we introduce Soft Policy Optimization (SPO), a simple, scalable and principled Soft RL method for sequence model policies that can learn from arbitrary online and offline trajectories and does not require a separate value model. In experiments on code contests, we shows that SPO outperforms PPO on pass@10, is significantly faster and more memory efficient, is able to benefit from off-policy data, enjoys improved stability, and learns more diverse (i.e. soft) policies.

Via

Access Paper or Ask Questions

PILAF: Optimal Human Preference Sampling for Reward Modeling

Feb 06, 2025

Yunzhen Feng, Ariel Kwiatkowski, Kunhao Zheng, Julia Kempe, Yaqi Duan

Abstract:As large language models increasingly drive real-world applications, aligning them with human values becomes paramount. Reinforcement Learning from Human Feedback (RLHF) has emerged as a key technique, translating preference data into reward models when oracle human values remain inaccessible. In practice, RLHF mostly relies on approximate reward models, which may not consistently guide the policy toward maximizing the underlying human values. We propose Policy-Interpolated Learning for Aligned Feedback (PILAF), a novel response sampling strategy for preference labeling that explicitly aligns preference learning with maximizing the underlying oracle reward. PILAF is theoretically grounded, demonstrating optimality from both an optimization and a statistical perspective. The method is straightforward to implement and demonstrates strong performance in iterative and online RLHF settings where feedback curation is critical.

Via

Access Paper or Ask Questions

What Makes Large Language Models Reason in (Multi-Turn) Code Generation?

Oct 10, 2024

Kunhao Zheng, Juliette Decugis, Jonas Gehring, Taco Cohen, Benjamin Negrevergne, Gabriel Synnaeve

Figure 1 for What Makes Large Language Models Reason in (Multi-Turn) Code Generation?

Figure 2 for What Makes Large Language Models Reason in (Multi-Turn) Code Generation?

Figure 3 for What Makes Large Language Models Reason in (Multi-Turn) Code Generation?

Figure 4 for What Makes Large Language Models Reason in (Multi-Turn) Code Generation?

Abstract:Prompting techniques such as chain-of-thought have established themselves as a popular vehicle for improving the outputs of large language models (LLMs). For code generation, however, their exact mechanics and efficacy are under-explored. We thus investigate the effects of a wide range of prompting strategies with a focus on automatic re-prompting over multiple turns and computational requirements. After systematically decomposing reasoning, instruction, and execution feedback prompts, we conduct an extensive grid search on the competitive programming benchmarks CodeContests and TACO for multiple LLM families and sizes (Llama 3.0 and 3.1, 8B, 70B, 405B, and GPT-4o). Our study reveals strategies that consistently improve performance across all models with small and large sampling budgets. We then show how finetuning with such an optimal configuration allows models to internalize the induced reasoning process and obtain improvements in performance and scalability for multi-turn code generation.

Via

Access Paper or Ask Questions

RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning

Oct 02, 2024

Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Taco Cohen, Gabriel Synnaeve

Figure 1 for RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning

Figure 2 for RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning

Figure 3 for RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning

Figure 4 for RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning

Abstract:Large language models (LLMs) deployed as agents solve user-specified tasks over multiple steps while keeping the required manual engagement to a minimum. Crucially, such LLMs need to ground their generations in any feedback obtained to reliably achieve desired outcomes. We propose an end-to-end reinforcement learning method for teaching models to leverage execution feedback in the realm of code synthesis, where state-of-the-art LLMs struggle to improve code iteratively compared to independent sampling. We benchmark on competitive programming tasks, where we achieve new start-of-the art results with both small (8B parameters) and large (70B) models while reducing the amount of samples required by an order of magnitude. Our analysis of inference-time behavior demonstrates that our method produces LLMs that effectively leverage automatic feedback over multiple steps.

Via

Access Paper or Ask Questions

D4FT: A Deep Learning Approach to Kohn-Sham Density Functional Theory

Mar 01, 2023

Tianbo Li, Min Lin, Zheyuan Hu, Kunhao Zheng, Giovanni Vignale, Kenji Kawaguchi, A. H. Castro Neto, Kostya S. Novoselov, Shuicheng Yan

Figure 1 for D4FT: A Deep Learning Approach to Kohn-Sham Density Functional Theory

Figure 2 for D4FT: A Deep Learning Approach to Kohn-Sham Density Functional Theory

Figure 3 for D4FT: A Deep Learning Approach to Kohn-Sham Density Functional Theory

Figure 4 for D4FT: A Deep Learning Approach to Kohn-Sham Density Functional Theory

Abstract:Kohn-Sham Density Functional Theory (KS-DFT) has been traditionally solved by the Self-Consistent Field (SCF) method. Behind the SCF loop is the physics intuition of solving a system of non-interactive single-electron wave functions under an effective potential. In this work, we propose a deep learning approach to KS-DFT. First, in contrast to the conventional SCF loop, we propose to directly minimize the total energy by reparameterizing the orthogonal constraint as a feed-forward computation. We prove that such an approach has the same expressivity as the SCF method, yet reduces the computational complexity from O(N^4) to O(N^3). Second, the numerical integration which involves a summation over the quadrature grids can be amortized to the optimization steps. At each step, stochastic gradient descent (SGD) is performed with a sampled minibatch of the grids. Extensive experiments are carried out to demonstrate the advantage of our approach in terms of efficiency and stability. In addition, we show that our approach enables us to explore more complex neural-based wave functions.

* Accepted by The Eleventh International Conference on Learning Representations (ICLR 2023, notable-top-25%)

Via

Access Paper or Ask Questions

Distilling Vision-Language Pre-training to Collaborate with Weakly-Supervised Temporal Action Localization

Dec 19, 2022

Chen Ju, Kunhao Zheng, Jinxiang Liu, Peisen Zhao, Ya Zhang, Jianlong Chang, Yanfeng Wang, Qi Tian

Figure 1 for Distilling Vision-Language Pre-training to Collaborate with Weakly-Supervised Temporal Action Localization

Figure 2 for Distilling Vision-Language Pre-training to Collaborate with Weakly-Supervised Temporal Action Localization

Figure 3 for Distilling Vision-Language Pre-training to Collaborate with Weakly-Supervised Temporal Action Localization

Figure 4 for Distilling Vision-Language Pre-training to Collaborate with Weakly-Supervised Temporal Action Localization

Abstract:Weakly-supervised temporal action localization (WTAL) learns to detect and classify action instances with only category labels. Most methods widely adopt the off-the-shelf Classification-Based Pre-training (CBP) to generate video features for action localization. However, the different optimization objectives between classification and localization, make temporally localized results suffer from the serious incomplete issue. To tackle this issue without additional annotations, this paper considers to distill free action knowledge from Vision-Language Pre-training (VLP), since we surprisingly observe that the localization results of vanilla VLP have an over-complete issue, which is just complementary to the CBP results. To fuse such complementarity, we propose a novel distillation-collaboration framework with two branches acting as CBP and VLP respectively. The framework is optimized through a dual-branch alternate training strategy. Specifically, during the B step, we distill the confident background pseudo-labels from the CBP branch; while during the F step, the confident foreground pseudo-labels are distilled from the VLP branch. And as a result, the dual-branch complementarity is effectively fused to promote a strong alliance. Extensive experiments and ablation studies on THUMOS14 and ActivityNet1.2 reveal that our method significantly outperforms state-of-the-art methods.

* The first two authors share the same contribution

Via

Access Paper or Ask Questions

Formal Mathematics Statement Curriculum Learning

Feb 03, 2022

Stanislas Polu, Jesse Michael Han, Kunhao Zheng, Mantas Baksys, Igor Babuschkin, Ilya Sutskever

Figure 1 for Formal Mathematics Statement Curriculum Learning

Figure 2 for Formal Mathematics Statement Curriculum Learning

Figure 3 for Formal Mathematics Statement Curriculum Learning

Figure 4 for Formal Mathematics Statement Curriculum Learning

Abstract:We explore the use of expert iteration in the context of language modeling applied to formal mathematics. We show that at same compute budget, expert iteration, by which we mean proof search interleaved with learning, dramatically outperforms proof search only. We also observe that when applied to a collection of formal statements of sufficiently varied difficulty, expert iteration is capable of finding and solving a curriculum of increasingly difficult problems, without the need for associated ground-truth proofs. Finally, by applying this expert iteration to a manually curated set of problem statements, we achieve state-of-the-art on the miniF2F benchmark, automatically solving multiple challenging problems drawn from high school olympiads.

Via

Access Paper or Ask Questions

Prompting Visual-Language Models for Efficient Video Understanding

Dec 08, 2021

Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, Weidi Xie

Figure 1 for Prompting Visual-Language Models for Efficient Video Understanding

Figure 2 for Prompting Visual-Language Models for Efficient Video Understanding

Figure 3 for Prompting Visual-Language Models for Efficient Video Understanding

Figure 4 for Prompting Visual-Language Models for Efficient Video Understanding

Abstract:Visual-language pre-training has shown great success for learning joint visual-textual representations from large-scale web data, demonstrating remarkable ability for zero-shot generalisation. This paper presents a simple method to efficiently adapt one pre-trained visual-language model to novel tasks with minimal training, and here, we consider video understanding tasks. Specifically, we propose to optimise a few random vectors, termed as continuous prompt vectors, that convert the novel tasks into the same format as the pre-training objectives. In addition, to bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features. Experimentally, we conduct extensive ablation studies to analyse the critical components and necessities. On 9 public benchmarks of action recognition, action localisation, and text-video retrieval, across closed-set, few-shot, open-set scenarios, we achieve competitive or state-of-the-art performance to existing methods, despite training significantly fewer parameters.

* Project page: https://ju-chen.github.io/efficient-prompt/

Via

Access Paper or Ask Questions