Picture for Guangxuan Xiao

Guangxuan Xiao

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Add code
Oct 14, 2024
Figure 1 for DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
Figure 2 for DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
Figure 3 for DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
Figure 4 for DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
Viaarxiv icon

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Add code
Jun 16, 2024
Figure 1 for Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
Figure 2 for Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
Figure 3 for Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
Figure 4 for Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
Viaarxiv icon

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

Add code
May 07, 2024
Viaarxiv icon

Retrieval Head Mechanistically Explains Long-Context Factuality

Add code
Apr 24, 2024
Viaarxiv icon

BitDelta: Your Fine-Tune May Only Be Worth One Bit

Add code
Feb 28, 2024
Viaarxiv icon

InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory

Add code
Feb 07, 2024
Viaarxiv icon

Efficient Streaming Language Models with Attention Sinks

Add code
Sep 29, 2023
Viaarxiv icon

FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention

Add code
May 21, 2023
Viaarxiv icon

Sparse and Local Networks for Hypergraph Reasoning

Add code
Mar 09, 2023
Viaarxiv icon

Offsite-Tuning: Transfer Learning without Full Model

Add code
Feb 09, 2023
Viaarxiv icon