Picture for Hui-Ling Zhen

Hui-Ling Zhen

PreMoe: Lightening MoEs on Constrained Memory by Expert Pruning and Retrieval

Add code
May 23, 2025
Viaarxiv icon

TrimR: Verifier-based Training-Free Thinking Compression for Efficient Test-Time Scaling

Add code
May 22, 2025
Viaarxiv icon

Harnessing On-Device Large Language Model: Empirical Results and Implications for AI PC

Add code
May 22, 2025
Viaarxiv icon

Harnessing Large Language Models Locally: Empirical Results and Implications for AI PC

Add code
May 21, 2025
Viaarxiv icon

Unlocking Efficient Long-to-Short LLM Reasoning with Model Merging

Add code
Mar 26, 2025
Viaarxiv icon

PASER: Post-Training Data Selection for Efficient Pruned Large Language Model Recovery

Add code
Feb 18, 2025
Viaarxiv icon

Certifying Language Model Robustness with Fuzzed Randomized Smoothing: An Efficient Defense Against Backdoor Attacks

Add code
Feb 09, 2025
Viaarxiv icon

KVTuner: Sensitivity-Aware Layer-wise Mixed Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference

Add code
Feb 06, 2025
Figure 1 for KVTuner: Sensitivity-Aware Layer-wise Mixed Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference
Figure 2 for KVTuner: Sensitivity-Aware Layer-wise Mixed Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference
Figure 3 for KVTuner: Sensitivity-Aware Layer-wise Mixed Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference
Figure 4 for KVTuner: Sensitivity-Aware Layer-wise Mixed Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference
Viaarxiv icon

CMoE: Fast Carving of Mixture-of-Experts for Efficient LLM Inference

Add code
Feb 06, 2025
Viaarxiv icon

MixPE: Quantization and Hardware Co-design for Efficient LLM Inference

Add code
Nov 25, 2024
Viaarxiv icon