Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Milad Hashemi

SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?

Nov 11, 2025

Jeffrey Jian Ma, Milad Hashemi, Amir Yazdanbakhsh, Kevin Swersky, Ofir Press, Enhui Li, Vijay Janapa Reddi, Parthasarathy Ranganathan

Figure 1 for SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?

Figure 2 for SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?

Figure 3 for SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?

Figure 4 for SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?

Abstract:Optimizing the performance of large-scale software repositories demands expertise in code reasoning and software engineering (SWE) to reduce runtime while preserving program correctness. However, most benchmarks emphasize what to fix rather than how to fix code. We introduce SWE-fficiency, a benchmark for evaluating repository-level performance optimization on real workloads. Our suite contains 498 tasks across nine widely used data-science, machine-learning, and HPC repositories (e.g., numpy, pandas, scipy): given a complete codebase and a slow workload, an agent must investigate code semantics, localize bottlenecks and relevant tests, and produce a patch that matches or exceeds expert speedup while passing the same unit tests. To enable this how-to-fix evaluation, our automated pipeline scrapes GitHub pull requests for performance-improving edits, combining keyword filtering, static analysis, coverage tooling, and execution validation to both confirm expert speedup baselines and identify relevant repository unit tests. Empirical evaluation of state-of-the-art agents reveals significant underperformance. On average, agents achieve less than 0.15x the expert speedup: agents struggle in localizing optimization opportunities, reasoning about execution across functions, and maintaining correctness in proposed edits. We release the benchmark and accompanying data pipeline to facilitate research on automated performance engineering and long-horizon software reasoning.

* Data, code, and leaderboard are available at https://swefficiency.com/

Via

Access Paper or Ask Questions

Learning Performance-Improving Code Edits

Feb 21, 2023

Aman Madaan, Alexander Shypula, Uri Alon, Milad Hashemi, Parthasarathy Ranganathan, Yiming Yang, Graham Neubig, Amir Yazdanbakhsh

Abstract:The waning of Moore's Law has shifted the focus of the tech industry towards alternative methods for continued performance gains. While optimizing compilers are a standard tool to help increase program efficiency, programmers continue to shoulder much responsibility in crafting and refactoring code with better performance characteristics. In this paper, we investigate the ability of large language models (LLMs) to suggest functionally correct, performance improving code edits. We hypothesize that language models can suggest such edits in ways that would be impractical for static analysis alone. We investigate these questions by curating a large-scale dataset of Performance-Improving Edits, PIE. PIE contains trajectories of programs, where a programmer begins with an initial, slower version and iteratively makes changes to improve the program's performance. We use PIE to evaluate and improve the capacity of large language models. Specifically, use examples from PIE to fine-tune multiple variants of CODEGEN, a billion-scale Transformer-decoder model. Additionally, we use examples from PIE to prompt OpenAI's CODEX using a few-shot prompting. By leveraging PIE, we find that both CODEX and CODEGEN can generate performance-improving edits, with speedups of more than 2.5x for over 25% of the programs, for C++ and Python, even after the C++ programs were compiled using the O3 optimization level. Crucially, we show that PIE allows CODEGEN, an open-sourced and 10x smaller model than CODEX, to match the performance of CODEX on this challenging task. Overall, this work opens new doors for creating systems and methods that can help programmers write efficient code.

* Project website: https://pie4perf.com/. This version extends the related work and acknowledgements

Via

Access Paper or Ask Questions

Towards Better Out-of-Distribution Generalization of Neural Algorithmic Reasoning Tasks

Nov 01, 2022

Sadegh Mahdavi, Kevin Swersky, Thomas Kipf, Milad Hashemi, Christos Thrampoulidis, Renjie Liao

Figure 1 for Towards Better Out-of-Distribution Generalization of Neural Algorithmic Reasoning Tasks

Figure 2 for Towards Better Out-of-Distribution Generalization of Neural Algorithmic Reasoning Tasks

Figure 3 for Towards Better Out-of-Distribution Generalization of Neural Algorithmic Reasoning Tasks

Figure 4 for Towards Better Out-of-Distribution Generalization of Neural Algorithmic Reasoning Tasks

Abstract:In this paper, we study the OOD generalization of neural algorithmic reasoning tasks, where the goal is to learn an algorithm (e.g., sorting, breadth-first search, and depth-first search) from input-output pairs using deep neural networks. First, we argue that OOD generalization in this setting is significantly different than common OOD settings. For example, some phenomena in OOD generalization of image classifications such as \emph{accuracy on the line} are not observed here, and techniques such as data augmentation methods do not help as assumptions underlying many augmentation techniques are often violated. Second, we analyze the main challenges (e.g., input distribution shift, non-representative data generation, and uninformative validation metrics) of the current leading benchmark, i.e., CLRS \citep{deepmind2021clrs}, which contains 30 algorithmic reasoning tasks. We propose several solutions, including a simple-yet-effective fix to the input distribution shift and improved data generation. Finally, we propose an attention-based 2WL-graph neural network (GNN) processor which complements message-passing GNNs so their combination outperforms the state-of-the-art model by a 3% margin averaged over all algorithms. Our code is available at: \url{https://github.com/smahdavi4/clrs}.

Via

Access Paper or Ask Questions

CUF: Continuous Upsampling Filters

Oct 20, 2022

Cristina Vasconcelos, Cengiz Oztireli, Mark Matthews, Milad Hashemi, Kevin Swersky, Andrea Tagliasacchi

Figure 1 for CUF: Continuous Upsampling Filters

Figure 2 for CUF: Continuous Upsampling Filters

Figure 3 for CUF: Continuous Upsampling Filters

Figure 4 for CUF: Continuous Upsampling Filters

Abstract:Neural fields have rapidly been adopted for representing 3D signals, but their application to more classical 2D image-processing has been relatively limited. In this paper, we consider one of the most important operations in image processing: upsampling. In deep learning, learnable upsampling layers have extensively been used for single image super-resolution. We propose to parameterize upsampling kernels as neural fields. This parameterization leads to a compact architecture that obtains a 40-fold reduction in the number of parameters when compared with competing arbitrary-scale super-resolution architectures. When upsampling images of size 256x256 we show that our architecture is 2x-10x more efficient than competing arbitrary-scale super-resolution architectures, and more efficient than sub-pixel convolutions when instantiated to a single-scale model. In the general setting, these gains grow polynomially with the square of the target scale. We validate our method on standard benchmarks showing such efficiency gains can be achieved without sacrifices in super-resolution performance.

Via

Access Paper or Ask Questions

Learning to Improve Code Efficiency

Aug 09, 2022

Binghong Chen, Daniel Tarlow, Kevin Swersky, Martin Maas, Pablo Heiber, Ashish Naik, Milad Hashemi, Parthasarathy Ranganathan

Figure 1 for Learning to Improve Code Efficiency

Figure 2 for Learning to Improve Code Efficiency

Figure 3 for Learning to Improve Code Efficiency

Figure 4 for Learning to Improve Code Efficiency

Abstract:Improvements in the performance of computing systems, driven by Moore's Law, have transformed society. As such hardware-driven gains slow down, it becomes even more important for software developers to focus on performance and efficiency during development. While several studies have demonstrated the potential from such improved code efficiency (e.g., 2x better generational improvements compared to hardware), unlocking these gains in practice has been challenging. Reasoning about algorithmic complexity and the interaction of coding patterns on hardware can be challenging for the average programmer, especially when combined with pragmatic constraints around development velocity and multi-person development. This paper seeks to address this problem. We analyze a large competitive programming dataset from the Google Code Jam competition and find that efficient code is indeed rare, with a 2x runtime difference between the median and the 90th percentile of solutions. We propose using machine learning to automatically provide prescriptive feedback in the form of hints, to guide programmers towards writing high-performance code. To automatically learn these hints from the dataset, we propose a novel discrete variational auto-encoder, where each discrete latent variable represents a different learned category of code-edit that increases performance. We show that this method represents the multi-modal space of code efficiency edits better than a sequence-to-sequence baseline and generates a distribution of more efficient solutions.

Via

Access Paper or Ask Questions

Data-Driven Offline Optimization For Architecting Hardware Accelerators

Oct 20, 2021

Aviral Kumar, Amir Yazdanbakhsh, Milad Hashemi, Kevin Swersky, Sergey Levine

Figure 1 for Data-Driven Offline Optimization For Architecting Hardware Accelerators

Figure 2 for Data-Driven Offline Optimization For Architecting Hardware Accelerators

Figure 3 for Data-Driven Offline Optimization For Architecting Hardware Accelerators

Figure 4 for Data-Driven Offline Optimization For Architecting Hardware Accelerators

Abstract:Industry has gradually moved towards application-specific hardware accelerators in order to attain higher efficiency. While such a paradigm shift is already starting to show promising results, designers need to spend considerable manual effort and perform a large number of time-consuming simulations to find accelerators that can accelerate multiple target applications while obeying design constraints. Moreover, such a "simulation-driven" approach must be re-run from scratch every time the set of target applications or design constraints change. An alternative paradigm is to use a "data-driven", offline approach that utilizes logged simulation data, to architect hardware accelerators, without needing any form of simulations. Such an approach not only alleviates the need to run time-consuming simulation, but also enables data reuse and applies even when set of target applications changes. In this paper, we develop such a data-driven offline optimization method for designing hardware accelerators, dubbed PRIME, that enjoys all of these properties. Our approach learns a conservative, robust estimate of the desired cost function, utilizes infeasible points, and optimizes the design against this estimate without any additional simulator queries during optimization. PRIME architects accelerators -- tailored towards both single and multiple applications -- improving performance upon state-of-the-art simulation-driven methods by about 1.54x and 1.20x, while considerably reducing the required total simulation time by 93% and 99%, respectively. In addition, PRIME also architects effective accelerators for unseen applications in a zero-shot setting, outperforming simulation-based methods by 1.26x.

* First two authors contributed equally

Via

Access Paper or Ask Questions

Two Sides of the Same Coin: Heterophily and Oversmoothing in Graph Convolutional Neural Networks

Feb 24, 2021

Yujun Yan, Milad Hashemi, Kevin Swersky, Yaoqing Yang, Danai Koutra

Figure 1 for Two Sides of the Same Coin: Heterophily and Oversmoothing in Graph Convolutional Neural Networks

Figure 2 for Two Sides of the Same Coin: Heterophily and Oversmoothing in Graph Convolutional Neural Networks

Figure 3 for Two Sides of the Same Coin: Heterophily and Oversmoothing in Graph Convolutional Neural Networks

Figure 4 for Two Sides of the Same Coin: Heterophily and Oversmoothing in Graph Convolutional Neural Networks

Abstract:Most graph neural networks (GNN) perform poorly in graphs where neighbors typically have different features/classes (heterophily) and when stacking multiple layers (oversmoothing). These two seemingly unrelated problems have been studied independently, but there is recent empirical evidence that solving one problem may benefit the other. In this work, going beyond empirical observations, we theoretically characterize the connections between heterophily and oversmoothing, both of which lead to indistinguishable node representations. By modeling the change in node representations during message propagation, we theoretically analyze the factors (e.g., degree, heterophily level) that make the representations of nodes from different classes indistinguishable. Our analysis highlights that (1) nodes with high heterophily and nodes with low heterophily and low degrees relative to their neighbors (degree discrepancy) trigger the oversmoothing problem, and (2) allowing "negative" messages between neighbors can decouple the heterophily and oversmoothing problems. Based on our insights, we design a model that addresses the discrepancy in features and degrees between neighbors by incorporating signed messages and learned degree corrections. Our experiments on 9 real networks show that our model achieves state-of-the-art performance under heterophily, and performs comparably to existing GNNs under low heterophily(homophily). It also effectively addresses oversmoothing and even benefits from multiple layers.

* 8 pages main paper + 7 pages appendix

Via

Access Paper or Ask Questions

Oops I Took A Gradient: Scalable Sampling for Discrete Distributions

Feb 08, 2021

Will Grathwohl, Kevin Swersky, Milad Hashemi, David Duvenaud, Chris J. Maddison

Figure 1 for Oops I Took A Gradient: Scalable Sampling for Discrete Distributions

Figure 2 for Oops I Took A Gradient: Scalable Sampling for Discrete Distributions

Figure 3 for Oops I Took A Gradient: Scalable Sampling for Discrete Distributions

Figure 4 for Oops I Took A Gradient: Scalable Sampling for Discrete Distributions

Abstract:We propose a general and scalable approximate sampling strategy for probabilistic models with discrete variables. Our approach uses gradients of the likelihood function with respect to its discrete inputs to propose updates in a Metropolis-Hastings sampler. We show empirically that this approach outperforms generic samplers in a number of difficult settings including Ising models, Potts models, restricted Boltzmann machines, and factorial hidden Markov models. We also demonstrate the use of our improved sampler for training deep energy-based models on high dimensional discrete data. This approach outperforms variational auto-encoders and existing energy-based models. Finally, we give bounds showing that our approach is near-optimal in the class of samplers which propose local updates.

* Energy-Based Models, Deep generative models, MCMC sampling

Via

Access Paper or Ask Questions

Apollo: Transferable Architecture Exploration

Feb 02, 2021

Amir Yazdanbakhsh, Christof Angermueller, Berkin Akin, Yanqi Zhou, Albin Jones, Milad Hashemi, Kevin Swersky, Satrajit Chatterjee, Ravi Narayanaswami, James Laudon

Figure 1 for Apollo: Transferable Architecture Exploration

Figure 2 for Apollo: Transferable Architecture Exploration

Figure 3 for Apollo: Transferable Architecture Exploration

Figure 4 for Apollo: Transferable Architecture Exploration

Abstract:The looming end of Moore's Law and ascending use of deep learning drives the design of custom accelerators that are optimized for specific neural architectures. Architecture exploration for such accelerators forms a challenging constrained optimization problem over a complex, high-dimensional, and structured input space with a costly to evaluate objective function. Existing approaches for accelerator design are sample-inefficient and do not transfer knowledge between related optimizations tasks with different design constraints, such as area and/or latency budget, or neural architecture configurations. In this work, we propose a transferable architecture exploration framework, dubbed Apollo, that leverages recent advances in black-box function optimization for sample-efficient accelerator design. We use this framework to optimize accelerator configurations of a diverse set of neural architectures with alternative design constraints. We show that our framework finds high reward design configurations (up to 24.6% speedup) more sample-efficiently than a baseline black-box optimization approach. We further show that by transferring knowledge between target architectures with different design constraints, Apollo is able to find optimal configurations faster and often with better objective value (up to 25% improvements). This encouraging outcome portrays a promising path forward to facilitate generating higher quality accelerators.

* 10 pages, 5 figures, Accepted to Workshop on ML for Systems at the 34th Conference on Neural Information Processing Systems (NeurIPS 2020)

Via

Access Paper or Ask Questions

No MCMC for me: Amortized sampling for fast and stable training of energy-based models

Oct 14, 2020

Will Grathwohl, Jacob Kelly, Milad Hashemi, Mohammad Norouzi, Kevin Swersky, David Duvenaud

Figure 1 for No MCMC for me: Amortized sampling for fast and stable training of energy-based models

Figure 2 for No MCMC for me: Amortized sampling for fast and stable training of energy-based models

Figure 3 for No MCMC for me: Amortized sampling for fast and stable training of energy-based models

Figure 4 for No MCMC for me: Amortized sampling for fast and stable training of energy-based models

Abstract:Energy-Based Models (EBMs) present a flexible and appealing way to represent uncertainty. Despite recent advances, training EBMs on high-dimensional data remains a challenging problem as the state-of-the-art approaches are costly, unstable, and require considerable tuning and domain expertise to apply successfully. In this work, we present a simple method for training EBMs at scale which uses an entropy-regularized generator to amortize the MCMC sampling typically used in EBM training. We improve upon prior MCMC-based entropy regularization methods with a fast variational approximation. We demonstrate the effectiveness of our approach by using it to train tractable likelihood models. Next, we apply our estimator to the recently proposed Joint Energy Model (JEM), where we match the original performance with faster and stable training. This allows us to extend JEM models to semi-supervised classification on tabular data from a variety of continuous domains.

Via

Access Paper or Ask Questions