Picture for Jiaming Tang

Jiaming Tang

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Add code
Oct 14, 2024
Figure 1 for DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
Figure 2 for DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
Figure 3 for DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
Figure 4 for DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
Viaarxiv icon

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Add code
Jun 16, 2024
Figure 1 for Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
Figure 2 for Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
Figure 3 for Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
Figure 4 for Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
Viaarxiv icon

DCRMTA: Unbiased Causal Representation for Multi-touch Attribution

Add code
Feb 05, 2024
Viaarxiv icon

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Add code
Jun 01, 2023
Figure 1 for AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Figure 2 for AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Figure 3 for AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Figure 4 for AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Viaarxiv icon