Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:AERO: Softmax-Only LLMs for Efficient Private Inference

Oct 16, 2024

Nandan Kumar Jha, Brandon Reagen

Figure 1 for AERO: Softmax-Only LLMs for Efficient Private Inference

Figure 2 for AERO: Softmax-Only LLMs for Efficient Private Inference

Figure 3 for AERO: Softmax-Only LLMs for Efficient Private Inference

Figure 4 for AERO: Softmax-Only LLMs for Efficient Private Inference

Share this with someone who'll enjoy it:

Abstract:The pervasiveness of proprietary language models has raised privacy concerns for users' sensitive data, emphasizing the need for private inference (PI), where inference is performed directly on encrypted inputs. However, current PI methods face prohibitively higher communication and latency overheads, primarily due to nonlinear operations. In this paper, we present a comprehensive analysis to understand the role of nonlinearities in transformer-based decoder-only language models. We introduce AERO, a four-step architectural optimization framework that refines the existing LLM architecture for efficient PI by systematically removing nonlinearities such as LayerNorm and GELU and reducing FLOPs counts. For the first time, we propose a Softmax-only architecture with significantly fewer FLOPs tailored for efficient PI. Furthermore, we devise a novel entropy regularization technique to improve the performance of Softmax-only models. AERO achieves up to 4.23$\times$ communication and 1.94$\times$ latency reduction. We validate the effectiveness of AERO by benchmarking it against the state-of-the-art.

* 35 pages, 21 figures, and 9 tables. arXiv admin note: text overlap with arXiv:2410.09637

View paper on

Share this with someone who'll enjoy it:

Title:AERO: Softmax-Only LLMs for Efficient Private Inference

Paper and Code