Picture for Zhen Zheng

Zhen Zheng

MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design

Add code
Dec 19, 2024
Viaarxiv icon

BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching

Add code
Nov 29, 2024
Viaarxiv icon

FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design

Add code
Jan 25, 2024
Viaarxiv icon

ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks

Add code
Dec 18, 2023
Figure 1 for ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks
Figure 2 for ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks
Figure 3 for ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks
Figure 4 for ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks
Viaarxiv icon

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

Add code
Sep 19, 2023
Viaarxiv icon

Auto-Parallelizing Large Models with Rhino: A Systematic Approach on Production AI Platform

Add code
Feb 16, 2023
Viaarxiv icon

FusionStitching: Boosting Memory Intensive Computations for Deep Learning Workloads

Add code
Sep 23, 2020
Figure 1 for FusionStitching: Boosting Memory Intensive Computations for Deep Learning Workloads
Figure 2 for FusionStitching: Boosting Memory Intensive Computations for Deep Learning Workloads
Figure 3 for FusionStitching: Boosting Memory Intensive Computations for Deep Learning Workloads
Figure 4 for FusionStitching: Boosting Memory Intensive Computations for Deep Learning Workloads
Viaarxiv icon

Auto-MAP: A DQN Framework for Exploring Distributed Execution Plans for DNN Workloads

Add code
Jul 08, 2020
Figure 1 for Auto-MAP: A DQN Framework for Exploring Distributed Execution Plans for DNN Workloads
Figure 2 for Auto-MAP: A DQN Framework for Exploring Distributed Execution Plans for DNN Workloads
Figure 3 for Auto-MAP: A DQN Framework for Exploring Distributed Execution Plans for DNN Workloads
Figure 4 for Auto-MAP: A DQN Framework for Exploring Distributed Execution Plans for DNN Workloads
Viaarxiv icon