Picture for Anshumali Shrivastava

Anshumali Shrivastava

Rice University

Scout Before You Attend: Sketch-and-Walk Sparse Attention for Efficient LLM Inference

Add code
Feb 07, 2026
Viaarxiv icon

SOCKET: SOft Collison Kernel EsTimator for Sparse Attention

Add code
Feb 06, 2026
Viaarxiv icon

70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float

Add code
Apr 15, 2025
Figure 1 for 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float
Figure 2 for 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float
Figure 3 for 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float
Figure 4 for 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float
Viaarxiv icon

I3S: Importance Sampling Subspace Selection for Low-Rank Optimization in LLM Pretraining

Add code
Feb 09, 2025
Figure 1 for I3S: Importance Sampling Subspace Selection for Low-Rank Optimization in LLM Pretraining
Figure 2 for I3S: Importance Sampling Subspace Selection for Low-Rank Optimization in LLM Pretraining
Figure 3 for I3S: Importance Sampling Subspace Selection for Low-Rank Optimization in LLM Pretraining
Figure 4 for I3S: Importance Sampling Subspace Selection for Low-Rank Optimization in LLM Pretraining
Viaarxiv icon

SpaLLM: Unified Compressive Adaptation of Large Language Models with Sketching

Add code
Oct 08, 2024
Figure 1 for SpaLLM: Unified Compressive Adaptation of Large Language Models with Sketching
Figure 2 for SpaLLM: Unified Compressive Adaptation of Large Language Models with Sketching
Figure 3 for SpaLLM: Unified Compressive Adaptation of Large Language Models with Sketching
Figure 4 for SpaLLM: Unified Compressive Adaptation of Large Language Models with Sketching
Viaarxiv icon

LeanQuant: Accurate Large Language Model Quantization with Loss-Error-Aware Grid

Add code
Jul 14, 2024
Figure 1 for LeanQuant: Accurate Large Language Model Quantization with Loss-Error-Aware Grid
Figure 2 for LeanQuant: Accurate Large Language Model Quantization with Loss-Error-Aware Grid
Figure 3 for LeanQuant: Accurate Large Language Model Quantization with Loss-Error-Aware Grid
Figure 4 for LeanQuant: Accurate Large Language Model Quantization with Loss-Error-Aware Grid
Viaarxiv icon

IDentity with Locality: An ideal hash for gene sequence search

Add code
Jun 21, 2024
Figure 1 for IDentity with Locality: An ideal hash for gene sequence search
Figure 2 for IDentity with Locality: An ideal hash for gene sequence search
Figure 3 for IDentity with Locality: An ideal hash for gene sequence search
Figure 4 for IDentity with Locality: An ideal hash for gene sequence search
Viaarxiv icon

KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization

Add code
May 07, 2024
Figure 1 for KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization
Figure 2 for KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization
Figure 3 for KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization
Figure 4 for KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization
Viaarxiv icon

NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention

Add code
Mar 02, 2024
Figure 1 for NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention
Figure 2 for NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention
Figure 3 for NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention
Figure 4 for NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention
Viaarxiv icon

Wisdom of Committee: Distilling from Foundation Model to Specialized Application Model

Add code
Feb 27, 2024
Viaarxiv icon