Picture for Zhihao Jia

Zhihao Jia

SuffixDecoding: A Model-Free Approach to Speeding Up Large Language Model Inference

Add code
Nov 07, 2024
Viaarxiv icon

MagicPIG: LSH Sampling for Efficient LLM Generation

Add code
Oct 21, 2024
Figure 1 for MagicPIG: LSH Sampling for Efficient LLM Generation
Figure 2 for MagicPIG: LSH Sampling for Efficient LLM Generation
Figure 3 for MagicPIG: LSH Sampling for Efficient LLM Generation
Figure 4 for MagicPIG: LSH Sampling for Efficient LLM Generation
Viaarxiv icon

TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention

Add code
Oct 07, 2024
Viaarxiv icon

GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism

Add code
Jun 24, 2024
Figure 1 for GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism
Figure 2 for GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism
Figure 3 for GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism
Figure 4 for GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism
Viaarxiv icon

SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices

Add code
Jun 04, 2024
Viaarxiv icon

Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs

Add code
Jun 03, 2024
Figure 1 for Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs
Figure 2 for Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs
Figure 3 for Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs
Figure 4 for Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs
Viaarxiv icon

A Multi-Level Superoptimizer for Tensor Programs

Add code
May 09, 2024
Viaarxiv icon

FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning

Add code
Feb 29, 2024
Figure 1 for FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning
Figure 2 for FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning
Figure 3 for FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning
Figure 4 for FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning
Viaarxiv icon

Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding

Add code
Feb 29, 2024
Viaarxiv icon

Accelerating Retrieval-Augmented Language Model Serving with Speculation

Add code
Jan 25, 2024
Viaarxiv icon