Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Boris Nazarov

Rethinking Data: Towards Better Performing Domain-Specific Small Language Models

Mar 03, 2025

Boris Nazarov, Darya Frolova, Yackov Lubarsky, Alexei Gaissinski, Pavel Kisilev

Abstract:Fine-tuning of Large Language Models (LLMs) for downstream tasks, performed on domain-specific data has shown significant promise. However, commercial use of such LLMs is limited by the high computational cost required for their deployment at scale. On the other hand, small Language Models (LMs) are much more cost effective but have subpar performance in a similar setup. This paper presents our approach to finetuning a small LM, that reaches high accuracy in multiple choice question answering task. We achieve this by improving data quality at each stage of the LM training pipeline. In particular, we start with data structuring resulting in extraction of compact, semantically meaningful text chunks used by a retriever. This allows more efficient knowledge digestion by the LM. Further, we improve the retrieved context by training a lightweight Chunk Re-Ranker (CRR) that generates more accurate relative relevance chunk scores. Finally, we improve the model generalization ability by merging the models fine-tuned with different parameters on different data subsets. We present detailed procedure descriptions, and corresponding experimental findings that show the improvements of each one of the proposed techniques.

* IEEE Global Communications Conference, GLOBECOM 2024 Workshop IMMLLM6G

Via

Access Paper or Ask Questions

Attention Condensation via Sparsity Induced Regularized Training

Mar 03, 2025

Eli Sason, Darya Frolova, Boris Nazarov, Felix Goldberd

Abstract:As the context window expands, self-attention increasingly dominates the transformer's inference time. Therefore, accelerating attention computation while minimizing performance degradation is essential for the efficient deployment of Large Language Models (LLMs). In this study we extend a theoretical framework of attention sparsity in LLMs. A customized loss function is designed to enforce the sparsity by restricting the number of top elements in the attention matrix. We perform an initial set of evaluations with GPT-2 to show the effectiveness of our sparsification approach. The attention matrices of the models trained with the proposed loss are both sparse and effective in capturing relevant input dependencies. We now continue working to demonstrate the value of our approach on larger models and different architectures.

Via

Access Paper or Ask Questions