Picture for Sehoon Kim

Sehoon Kim

Squeezed Attention: Accelerating Long Context Length LLM Inference

Add code
Nov 14, 2024
Viaarxiv icon

Efficient and Scalable Estimation of Tool Representations in Vector Space

Add code
Sep 02, 2024
Figure 1 for Efficient and Scalable Estimation of Tool Representations in Vector Space
Figure 2 for Efficient and Scalable Estimation of Tool Representations in Vector Space
Figure 3 for Efficient and Scalable Estimation of Tool Representations in Vector Space
Figure 4 for Efficient and Scalable Estimation of Tool Representations in Vector Space
Viaarxiv icon

TinyAgent: Function Calling at the Edge

Add code
Sep 01, 2024
Viaarxiv icon

Characterizing Prompt Compression Methods for Long Context Inference

Add code
Jul 11, 2024
Viaarxiv icon

LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement

Add code
Mar 22, 2024
Figure 1 for LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement
Figure 2 for LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement
Figure 3 for LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement
Figure 4 for LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement
Viaarxiv icon

AI and Memory Wall

Add code
Mar 21, 2024
Figure 1 for AI and Memory Wall
Figure 2 for AI and Memory Wall
Figure 3 for AI and Memory Wall
Figure 4 for AI and Memory Wall
Viaarxiv icon

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

Add code
Feb 07, 2024
Viaarxiv icon

Learned Best-Effort LLM Serving

Add code
Jan 15, 2024
Figure 1 for Learned Best-Effort LLM Serving
Figure 2 for Learned Best-Effort LLM Serving
Figure 3 for Learned Best-Effort LLM Serving
Figure 4 for Learned Best-Effort LLM Serving
Viaarxiv icon

An LLM Compiler for Parallel Function Calling

Add code
Dec 07, 2023
Viaarxiv icon

SPEED: Speculative Pipelined Execution for Efficient Decoding

Add code
Oct 18, 2023
Viaarxiv icon