Picture for Luis Ceze

Luis Ceze

University of Washington

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

Add code
Jan 02, 2025
Figure 1 for FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
Figure 2 for FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
Figure 3 for FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
Figure 4 for FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
Viaarxiv icon

Palu: Compressing KV-Cache with Low-Rank Projection

Add code
Jul 30, 2024
Viaarxiv icon

Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Add code
Nov 07, 2023
Figure 1 for Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
Figure 2 for Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
Figure 3 for Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
Figure 4 for Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
Viaarxiv icon

Punica: Multi-Tenant LoRA Serving

Add code
Oct 28, 2023
Viaarxiv icon

SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning

Add code
Jul 11, 2022
Figure 1 for SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning
Figure 2 for SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning
Figure 3 for SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning
Figure 4 for SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning
Viaarxiv icon

Characterizing and Taming Resolution in Convolutional Neural Networks

Add code
Oct 28, 2021
Figure 1 for Characterizing and Taming Resolution in Convolutional Neural Networks
Figure 2 for Characterizing and Taming Resolution in Convolutional Neural Networks
Figure 3 for Characterizing and Taming Resolution in Convolutional Neural Networks
Figure 4 for Characterizing and Taming Resolution in Convolutional Neural Networks
Viaarxiv icon

Cloud Collectives: Towards Cloud-aware Collectives forML Workloads with Rank Reordering

Add code
May 28, 2021
Figure 1 for Cloud Collectives: Towards Cloud-aware Collectives forML Workloads with Rank Reordering
Figure 2 for Cloud Collectives: Towards Cloud-aware Collectives forML Workloads with Rank Reordering
Figure 3 for Cloud Collectives: Towards Cloud-aware Collectives forML Workloads with Rank Reordering
Figure 4 for Cloud Collectives: Towards Cloud-aware Collectives forML Workloads with Rank Reordering
Viaarxiv icon

Accelerating SpMM Kernel with Cache-First Edge Sampling for Graph Neural Networks

Add code
Apr 23, 2021
Figure 1 for Accelerating SpMM Kernel with Cache-First Edge Sampling for Graph Neural Networks
Figure 2 for Accelerating SpMM Kernel with Cache-First Edge Sampling for Graph Neural Networks
Figure 3 for Accelerating SpMM Kernel with Cache-First Edge Sampling for Graph Neural Networks
Figure 4 for Accelerating SpMM Kernel with Cache-First Edge Sampling for Graph Neural Networks
Viaarxiv icon

Automated Backend-Aware Post-Training Quantization

Add code
Mar 27, 2021
Figure 1 for Automated Backend-Aware Post-Training Quantization
Figure 2 for Automated Backend-Aware Post-Training Quantization
Figure 3 for Automated Backend-Aware Post-Training Quantization
Figure 4 for Automated Backend-Aware Post-Training Quantization
Viaarxiv icon

Learning to Optimize Tensor Programs

Add code
Oct 27, 2018
Figure 1 for Learning to Optimize Tensor Programs
Figure 2 for Learning to Optimize Tensor Programs
Figure 3 for Learning to Optimize Tensor Programs
Figure 4 for Learning to Optimize Tensor Programs
Viaarxiv icon