Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kunle Olukotun

Stanford University

FuseFlow: A Fusion-Centric Compilation Framework for Sparse Deep Learning on Streaming Dataflow

Nov 06, 2025

Rubens Lacouture, Nathan Zhang, Ritvik Sharma, Marco Siracusa, Fredrik Kjolstad, Kunle Olukotun, Olivia Hsu

Abstract:As deep learning models scale, sparse computation and specialized dataflow hardware have emerged as powerful solutions to address efficiency. We propose FuseFlow, a compiler that converts sparse machine learning models written in PyTorch to fused sparse dataflow graphs for reconfigurable dataflow architectures (RDAs). FuseFlow is the first compiler to support general cross-expression fusion of sparse operations. In addition to fusion across kernels (expressions), FuseFlow also supports optimizations like parallelization, dataflow ordering, and sparsity blocking. It targets a cycle-accurate dataflow simulator for microarchitectural analysis of fusion strategies. We use FuseFlow for design-space exploration across four real-world machine learning applications with sparsity, showing that full fusion (entire cross-expression fusion across all computation in an end-to-end model) is not always optimal for sparse models-fusion granularity depends on the model itself. FuseFlow also provides a heuristic to identify and prune suboptimal configurations. Using Fuseflow, we achieve performance improvements, including a ~2.7x speedup over an unfused baseline for GPT-3 with BigBird block-sparse attention.

Via

Access Paper or Ask Questions

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Oct 06, 2025

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li(+3 more)

Abstract:Large language model (LLM) applications such as agents and domain-specific reasoning increasingly rely on context adaptation -- modifying inputs with instructions, strategies, or evidence, rather than weight updates. Prior approaches improve usability but often suffer from brevity bias, which drops domain insights for concise summaries, and from context collapse, where iterative rewriting erodes details over time. Building on the adaptive memory introduced by Dynamic Cheatsheet, we introduce ACE (Agentic Context Engineering), a framework that treats contexts as evolving playbooks that accumulate, refine, and organize strategies through a modular process of generation, reflection, and curation. ACE prevents collapse with structured, incremental updates that preserve detailed knowledge and scale with long-context models. Across agent and domain-specific benchmarks, ACE optimizes contexts both offline (e.g., system prompts) and online (e.g., agent memory), consistently outperforming strong baselines: +10.6% on agents and +8.6% on finance, while significantly reducing adaptation latency and rollout cost. Notably, ACE could adapt effectively without labeled supervision and instead by leveraging natural execution feedback. On the AppWorld leaderboard, ACE matches the top-ranked production-level agent on the overall average and surpasses it on the harder test-challenge split, despite using a smaller open-source model. These results show that comprehensive, evolving contexts enable scalable, efficient, and self-improving LLM systems with low overhead.

Via

Access Paper or Ask Questions

Cost-Efficient Serving of LLM Agents via Test-Time Plan Caching

Jun 17, 2025

Qizheng Zhang, Michael Wornow, Kunle Olukotun

Abstract:LLM-based agentic applications have shown increasingly remarkable capabilities in complex workflows but incur substantial costs due to extensive planning and reasoning requirements. Existing LLM caching techniques (like context caching and semantic caching), primarily designed for serving chatbots, are insufficient for agentic applications where outputs depend on external data or environmental contexts. We propose agentic plan caching, a novel approach that extracts, stores, adapts, and reuses structured plan templates from planning stages of agentic applications across semantically similar tasks to reduce the cost of serving. Unlike traditional semantic caching, our system extracts plan templates from completed agent executions at test-time, employs keyword extraction to match new requests against cached plans, and utilizes lightweight models to adapt these templates to task-specific plans with contexts. Evaluation across multiple real-world agentic applications shows that our system can reduce costs by 46.62% on average while maintaining performance, offering a more efficient solution for serving LLM-based agents that complements existing LLM serving infrastructures.

* 23 pages

Via

Access Paper or Ask Questions

LowRA: Accurate and Efficient LoRA Fine-Tuning of LLMs under 2 Bits

Feb 12, 2025

Zikai Zhou, Qizheng Zhang, Hermann Kumbong, Kunle Olukotun

Figure 1 for LowRA: Accurate and Efficient LoRA Fine-Tuning of LLMs under 2 Bits

Figure 2 for LowRA: Accurate and Efficient LoRA Fine-Tuning of LLMs under 2 Bits

Figure 3 for LowRA: Accurate and Efficient LoRA Fine-Tuning of LLMs under 2 Bits

Figure 4 for LowRA: Accurate and Efficient LoRA Fine-Tuning of LLMs under 2 Bits

Abstract:Fine-tuning large language models (LLMs) is increasingly costly as models scale to hundreds of billions of parameters, and even parameter-efficient fine-tuning (PEFT) methods like LoRA remain resource-intensive. We introduce LowRA, the first framework to enable LoRA fine-tuning below 2 bits per parameter with minimal performance loss. LowRA optimizes fine-grained quantization - mapping, threshold selection, and precision assignment - while leveraging efficient CUDA kernels for scalable deployment. Extensive evaluations across 4 LLMs and 4 datasets show that LowRA achieves a superior performance-precision trade-off above 2 bits and remains accurate down to 1.15 bits, reducing memory usage by up to 50%. Our results highlight the potential of ultra-low-bit LoRA fine-tuning for resource-constrained environments.

Via

Access Paper or Ask Questions

Adaptive Self-improvement LLM Agentic System for ML Library Development

Feb 04, 2025

Genghan Zhang, Weixin Liang, Olivia Hsu, Kunle Olukotun

Abstract:ML libraries, often written in architecture-specific programming languages (ASPLs) that target domain-specific architectures, are key to efficient ML systems. However, writing these high-performance ML libraries is challenging because it requires expert knowledge of ML algorithms and the ASPL. Large language models (LLMs), on the other hand, have shown general coding capabilities. However, challenges remain when using LLMs for generating ML libraries using ASPLs because 1) this task is complicated even for experienced human programmers and 2) there are limited code examples because of the esoteric and evolving nature of ASPLs. Therefore, LLMs need complex reasoning with limited data in order to complete this task. To address these challenges, we introduce an adaptive self-improvement agentic system. In order to evaluate the effectiveness of our system, we construct a benchmark of a typical ML library and generate ASPL code with both open and closed-source LLMs on this benchmark. Our results show improvements of up to $3.9\times$ over a baseline single LLM.

Via

Access Paper or Ask Questions

SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts

May 13, 2024

Raghu Prabhakar, Ram Sivaramakrishnan, Darshan Gandhi, Yun Du, Mingran Wang, Xiangyu Song, Kejie Zhang, Tianren Gao, Angela Wang, Karen Li(+20 more)

Figure 1 for SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts

Figure 2 for SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts

Figure 3 for SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts

Figure 4 for SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts

Abstract:Monolithic large language models (LLMs) like GPT-4 have paved the way for modern generative AI applications. Training, serving, and maintaining monolithic LLMs at scale, however, remains prohibitively expensive and challenging. The disproportionate increase in compute-to-memory ratio of modern AI accelerators have created a memory wall, necessitating new methods to deploy AI. Composition of Experts (CoE) is an alternative modular approach that lowers the cost and complexity of training and serving. However, this approach presents two key challenges when using conventional hardware: (1) without fused operations, smaller models have lower operational intensity, which makes high utilization more challenging to achieve; and (2) hosting a large number of models can be either prohibitively expensive or slow when dynamically switching between them. In this paper, we describe how combining CoE, streaming dataflow, and a three-tier memory system scales the AI memory wall. We describe Samba-CoE, a CoE system with 150 experts and a trillion total parameters. We deploy Samba-CoE on the SambaNova SN40L Reconfigurable Dataflow Unit (RDU) - a commercial dataflow accelerator architecture that has been co-designed for enterprise inference and training applications. The chip introduces a new three-tier memory system with on-chip distributed SRAM, on-package HBM, and off-package DDR DRAM. A dedicated inter-RDU network enables scaling up and out over multiple sockets. We demonstrate speedups ranging from 2x to 13x on various benchmarks running on eight RDU sockets compared with an unfused baseline. We show that for CoE inference deployments, the 8-socket RDU Node reduces machine footprint by up to 19x, speeds up model switching time by 15x to 31x, and achieves an overall speedup of 3.7x over a DGX H100 and 6.6x over a DGX A100.

Via

Access Paper or Ask Questions

BaCO: A Fast and Portable Bayesian Compiler Optimization Framework

Dec 01, 2022

Erik Hellsten, Artur Souza, Johannes Lenfers, Rubens Lacouture, Olivia Hsu, Adel Ejjeh, Fredrik Kjolstad, Michel Steuwer, Kunle Olukotun, Luigi Nardi

Figure 1 for BaCO: A Fast and Portable Bayesian Compiler Optimization Framework

Figure 2 for BaCO: A Fast and Portable Bayesian Compiler Optimization Framework

Figure 3 for BaCO: A Fast and Portable Bayesian Compiler Optimization Framework

Figure 4 for BaCO: A Fast and Portable Bayesian Compiler Optimization Framework

Abstract:We introduce the Bayesian Compiler Optimization framework (BaCO), a general purpose autotuner for modern compilers targeting CPUs, GPUs, and FPGAs. BaCO provides the flexibility needed to handle the requirements of modern autotuning tasks. Particularly, it deals with permutation, ordered, and continuous parameter types along with both known and unknown parameter constraints. To reason about these parameter types and efficiently deliver high-quality code, BaCO uses Bayesian optimization algorithms specialized towards the autotuning domain. We demonstrate BaCO's effectiveness on three modern compiler systems: TACO, RISE & ELEVATE, and HPVM2FPGA for CPUs, GPUs, and FPGAs respectively. For these domains, BaCO outperforms current state-of-the-art autotuners by delivering on average 1.39x-1.89x faster code with a tiny search budget, and BaCO is able to reach expert-level performance 2.89x-8.77x faster.

Via

Access Paper or Ask Questions

Efficient Memory Partitioning in Software Defined Hardware

Feb 02, 2022

Matthew Feldman, Tian Zhao, Kunle Olukotun

Figure 1 for Efficient Memory Partitioning in Software Defined Hardware

Figure 2 for Efficient Memory Partitioning in Software Defined Hardware

Figure 3 for Efficient Memory Partitioning in Software Defined Hardware

Figure 4 for Efficient Memory Partitioning in Software Defined Hardware

Abstract:As programmers turn to software-defined hardware (SDH) to maintain a high level of productivity while programming hardware to run complex algorithms, heavy-lifting must be done by the compiler to automatically partition on-chip arrays. In this paper, we introduce an automatic memory partitioning system that can quickly compute more efficient partitioning schemes than prior systems. Our system employs a variety of resource-saving optimizations and an ML cost model to select the best partitioning scheme from an array of candidates. We compared our system against various state-of-the-art SDH compilers and FPGAs on a variety of benchmarks and found that our system generates solutions that, on average, consume 40.3% fewer logic resources, 78.3% fewer FFs, 54.9% fewer Block RAMs (BRAMs), and 100% fewer DSPs.

Via

Access Paper or Ask Questions

Prior-guided Bayesian Optimization

Jun 25, 2020

Artur Souza, Luigi Nardi, Leonardo B. Oliveira, Kunle Olukotun, Marius Lindauer, Frank Hutter

Figure 1 for Prior-guided Bayesian Optimization

Figure 2 for Prior-guided Bayesian Optimization

Figure 3 for Prior-guided Bayesian Optimization

Figure 4 for Prior-guided Bayesian Optimization

Abstract:While Bayesian Optimization (BO) is a very popular method for optimizing expensive black-box functions, it fails to leverage the experience of domain experts. This causes BO to waste function evaluations on commonly known bad regions of design choices, e.g., hyperparameters of a machine learning algorithm. To address this issue, we introduce Prior-guided Bayesian Optimization (PrBO). PrBO allows users to inject their knowledge into the optimization process in the form of priors about which parts of the input space will yield the best performance, rather than BO's standard priors over functions which are much less intuitive for users. PrBO then combines these priors with BO's standard probabilistic model to yield a posterior. We show that PrBO is more sample efficient than state-of-the-art methods without user priors and 10,000$\times$ faster than random search, on a common suite of benchmarks and a real-world hardware design application. We also show that PrBO converges faster even if the user priors are not entirely accurate and that it robustly recovers from misleading priors.

Via

Access Paper or Ask Questions

Taurus: An Intelligent Data Plane

Feb 12, 2020

Tushar Swamy, Alexander Rucker, Muhammad Shahbaz, Kunle Olukotun

Figure 1 for Taurus: An Intelligent Data Plane

Figure 2 for Taurus: An Intelligent Data Plane

Figure 3 for Taurus: An Intelligent Data Plane

Figure 4 for Taurus: An Intelligent Data Plane

Abstract:Emerging applications -- cloud computing, the internet of things, and augmented/virtual reality -- need responsive, available, secure, ubiquitous, and scalable datacenter networks. Network management currently uses simple, per-packet, data-plane heuristics (e.g., ECMP and sketches) under an intelligent, millisecond-latency control plane that runs data-driven performance and security policies. However, to meet users' quality-of-service expectations in a modern data center, networks must operate intelligently at line rate. In this paper, we present Taurus, an intelligent data plane capable of machine-learning inference at line rate. Taurus adds custom hardware based on a map-reduce abstraction to programmable network devices, such as switches and NICs; this new hardware uses pipelined and SIMD parallelism for fast inference. Our evaluation of a Taurus-enabled switch ASIC -- supporting several real-world benchmarks -- shows that Taurus operates three orders of magnitude faster than a server-based control plane, while increasing area by 24% and latency, on average, by 178 ns. On the long road to self-driving networks, Taurus is the equivalent of adaptive cruise control: deterministic rules steer flows, while machine learning tunes performance and heightens security.

* 12 pages, 13 figures, and 6 tables

Via

Access Paper or Ask Questions