Picture for Ramachandran Ramjee

Ramachandran Ramjee

Microsoft

POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference

Add code
Oct 23, 2024
Viaarxiv icon

ASTRA: Accurate and Scalable ANNS-based Training of Extreme Classifiers

Add code
Sep 30, 2024
Viaarxiv icon

Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations

Add code
Sep 25, 2024
Figure 1 for Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations
Figure 2 for Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations
Figure 3 for Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations
Figure 4 for Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations
Viaarxiv icon

Accuracy is Not All You Need

Add code
Jul 12, 2024
Figure 1 for Accuracy is Not All You Need
Figure 2 for Accuracy is Not All You Need
Figure 3 for Accuracy is Not All You Need
Figure 4 for Accuracy is Not All You Need
Viaarxiv icon

Metron: Holistic Performance Evaluation Framework for LLM Inference Systems

Add code
Jul 09, 2024
Figure 1 for Metron: Holistic Performance Evaluation Framework for LLM Inference Systems
Figure 2 for Metron: Holistic Performance Evaluation Framework for LLM Inference Systems
Figure 3 for Metron: Holistic Performance Evaluation Framework for LLM Inference Systems
Figure 4 for Metron: Holistic Performance Evaluation Framework for LLM Inference Systems
Viaarxiv icon

Vidur: A Large-Scale Simulation Framework For LLM Inference

Add code
May 08, 2024
Figure 1 for Vidur: A Large-Scale Simulation Framework For LLM Inference
Figure 2 for Vidur: A Large-Scale Simulation Framework For LLM Inference
Figure 3 for Vidur: A Large-Scale Simulation Framework For LLM Inference
Figure 4 for Vidur: A Large-Scale Simulation Framework For LLM Inference
Viaarxiv icon

vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention

Add code
May 07, 2024
Viaarxiv icon

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

Add code
Mar 04, 2024
Figure 1 for Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
Figure 2 for Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
Figure 3 for Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
Figure 4 for Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
Viaarxiv icon

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

Add code
Aug 31, 2023
Figure 1 for SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
Figure 2 for SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
Figure 3 for SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
Figure 4 for SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
Viaarxiv icon

NGAME: Negative Mining-aware Mini-batching for Extreme Classification

Add code
Jul 10, 2022
Figure 1 for NGAME: Negative Mining-aware Mini-batching for Extreme Classification
Figure 2 for NGAME: Negative Mining-aware Mini-batching for Extreme Classification
Figure 3 for NGAME: Negative Mining-aware Mini-batching for Extreme Classification
Figure 4 for NGAME: Negative Mining-aware Mini-batching for Extreme Classification
Viaarxiv icon