Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Alexander M Rush

Composer 2 Technical Report

Mar 25, 2026

Cursor Reseach, :, Aaron Chan, Ahmed Shalaby, Alexander Wettig, Aman Sanger, Andrew Zhai, Anurag Ajay, Ashvin Nair, Charlie Snell(+46 more)

Abstract:Composer 2 is a specialized model designed for agentic software engineering. The model demonstrates strong long-term planning and coding intelligence while maintaining the ability to efficiently solve problems for interactive use. The model is trained in two phases: first, continued pretraining to improve the model's knowledge and latent coding ability, followed by large-scale reinforcement learning to improve end-to-end coding performance through stronger reasoning, accurate multi-step execution, and coherence on long-horizon realistic coding problems. We develop infrastructure to support training in the same Cursor harness that is used by the deployed model, with equivalent tools and structure, and use environments that match real problems closely. To measure the ability of the model on increasingly difficult tasks, we introduce a benchmark derived from real software engineering problems in large codebases including our own. Composer 2 is a frontier-level coding model and demonstrates a process for training strong domain-specialized models. On our CursorBench evaluations the model achieves a major improvement in accuracy compared to previous Composer models (61.3). On public benchmarks the model scores 61.7 on Terminal-Bench and 73.7 on SWE-bench Multilingual in our harness, comparable to state-of-the-art systems.

Via

Access Paper or Ask Questions

Multi-Turn Code Generation Through Single-Step Rewards

Feb 27, 2025

Arnav Kumar Jain, Gonzalo Gonzalez-Pumariega, Wayne Chen, Alexander M Rush, Wenting Zhao, Sanjiban Choudhury

Abstract:We address the problem of code generation from multi-turn execution feedback. Existing methods either generate code without feedback or use complex, hierarchical reinforcement learning to optimize multi-turn rewards. We propose a simple yet scalable approach, $\mu$Code, that solves multi-turn code generation using only single-step rewards. Our key insight is that code generation is a one-step recoverable MDP, where the correct code can be recovered from any intermediate code state in a single turn. $\mu$Code iteratively trains both a generator to provide code solutions conditioned on multi-turn execution feedback and a verifier to score the newly generated code. Experimental evaluations show that our approach achieves significant improvements over the state-of-the-art baselines. We provide analysis of the design choices of the reward models and policy, and show the efficacy of $\mu$Code at utilizing the execution feedback. Our code is available at https://github.com/portal-cornell/muCode.

* 9 pages (not including references or appendix); 6 figures (in main paper); (v1) preprint

Via

Access Paper or Ask Questions

Commit0: Library Generation from Scratch

Dec 02, 2024

Wenting Zhao, Nan Jiang, Celine Lee, Justin T Chiu, Claire Cardie, Matthias Gallé, Alexander M Rush

Abstract:With the goal of benchmarking generative systems beyond expert software development ability, we introduce Commit0, a benchmark that challenges AI agents to write libraries from scratch. Agents are provided with a specification document outlining the library's API as well as a suite of interactive unit tests, with the goal of producing an implementation of this API accordingly. The implementation is validated through running these unit tests. As a benchmark, Commit0 is designed to move beyond static one-shot code generation towards agents that must process long-form natural language specifications, adapt to multi-stage feedback, and generate code with complex dependencies. Commit0 also offers an interactive environment where models receive static analysis and execution feedback on the code they generate. Our experiments demonstrate that while current agents can pass some unit tests, none can yet fully reproduce full libraries. Results also show that interactive feedback is quite useful for models to generate code that passes more unit tests, validating the benchmarks that facilitate its use.

Via

Access Paper or Ask Questions

Great Memory, Shallow Reasoning: Limits of $k$NN-LMs

Aug 21, 2024

Shangyi Geng, Wenting Zhao, Alexander M Rush

Figure 1 for Great Memory, Shallow Reasoning: Limits of $k$NN-LMs

Figure 2 for Great Memory, Shallow Reasoning: Limits of $k$NN-LMs

Figure 3 for Great Memory, Shallow Reasoning: Limits of $k$NN-LMs

Figure 4 for Great Memory, Shallow Reasoning: Limits of $k$NN-LMs

Abstract:$K$-nearest neighbor language models ($k$NN-LMs), which integrate retrieval with next-word prediction, have demonstrated strong performance in language modeling as well as downstream NLP benchmarks. These results have led researchers to argue that models trained on poor quality or outdated data could perform well by employing a $k$NN extension that has access to a higher-quality datastore. In this work, we ask whether this improved ability to recall information really translates into downstream abilities. We extensively evaluate $k$NN-LMs on a diverse set of tasks, ranging from sentiment classification and commonsense reasoning to multi-hop reasoning. Results show that $k$NN-LMs excel at memory-intensive tasks, where utilizing the patterns in the input is sufficient for determining the output, but struggle with reasoning tasks that require integrating multiple pieces of information to derive new knowledge. We further demonstrate through oracle experiments and qualitative analysis that even with perfect retrieval, $k$NN-LMs still fail to determine the correct answers, placing an upper bound on their reasoning performance. Code and datastores are released at https://github.com/GSYfate/knnlm-limits/.

Via

Access Paper or Ask Questions

ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models

Jun 24, 2024

Yash Akhauri, Ahmed F AbouElhamayed, Jordan Dotzel, Zhiru Zhang, Alexander M Rush, Safeen Huda, Mohamed S Abdelfattah

Figure 1 for ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models

Figure 2 for ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models

Figure 3 for ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models

Figure 4 for ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models

Abstract:The high power consumption and latency-sensitive deployments of large language models (LLMs) have motivated techniques like quantization and sparsity. Contextual sparsity, where the sparsity pattern is input-dependent, is crucial in LLMs because the permanent removal of attention heads or neurons from LLMs can significantly degrade accuracy. Prior work has attempted to model contextual sparsity using neural networks trained to predict activation magnitudes, which can be used to dynamically prune structures with low predicted activation magnitude. In this paper, we look beyond magnitude-based pruning criteria to assess attention head and neuron importance in LLMs. We developed a novel predictor called ShadowLLM, which can shadow the LLM behavior and enforce better sparsity patterns, resulting in over 15% improvement in end-to-end accuracy without increasing latency compared to previous methods. ShadowLLM achieves up to a 20\% speed-up over the state-of-the-art DejaVu framework. These enhancements are validated on models with up to 30 billion parameters. Our code is available at \href{https://github.com/abdelfattah-lab/shadow_llm/}{ShadowLLM}.

Via

Access Paper or Ask Questions

MambaByte: Token-free Selective State Space Model

Jan 24, 2024

Junxiong Wang, Tushaar Gangavarapu, Jing Nathan Yan, Alexander M Rush

Figure 1 for MambaByte: Token-free Selective State Space Model

Figure 2 for MambaByte: Token-free Selective State Space Model

Figure 3 for MambaByte: Token-free Selective State Space Model

Figure 4 for MambaByte: Token-free Selective State Space Model

Abstract:Token-free language models learn directly from raw bytes and remove the bias of subword tokenization. Operating on bytes, however, results in significantly longer sequences, and standard autoregressive Transformers scale poorly in such settings. We experiment with MambaByte, a token-free adaptation of the Mamba state space model, trained autoregressively on byte sequences. Our experiments indicate the computational efficiency of MambaByte compared to other byte-level models. We also find MambaByte to be competitive with and even outperform state-of-the-art subword Transformers. Furthermore, owing to linear scaling in length, MambaByte benefits from fast inference compared to Transformers. Our findings establish the viability of MambaByte in enabling token-free language modeling.

Via

Access Paper or Ask Questions