Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Evaluating Large Language Models for Generalization and Robustness via Data Compression

Feb 04, 2024

Yucheng Li, Yunhao Guo, Frank Guerin, Chenghua Lin

Figure 1 for Evaluating Large Language Models for Generalization and Robustness via Data Compression

Figure 2 for Evaluating Large Language Models for Generalization and Robustness via Data Compression

Figure 3 for Evaluating Large Language Models for Generalization and Robustness via Data Compression

Figure 4 for Evaluating Large Language Models for Generalization and Robustness via Data Compression

Share this with someone who'll enjoy it:

Abstract:Existing methods for evaluating large language models face challenges such as data contamination, sensitivity to prompts, and the high cost of benchmark creation. To address this, we propose a lossless data compression based evaluation approach that tests how models' predictive abilities generalize after their training cutoff. Specifically, we collect comprehensive test data spanning 83 months from 2017 to 2023 and split the data into training and testing periods according to models' training data cutoff. We measure: 1) the compression performance on the testing period as a measure of generalization on unseen data; and 2) the performance gap between the training and testing period as a measure of robustness. Our experiments test 14 representative large language models with various sizes on sources including Wikipedia, news articles, code, arXiv papers, and multi-modal data. We find that the compression rate of many models reduces significantly after their cutoff date, but models such as Mistral and Llama-2 demonstrate a good balance between performance and robustness. Results also suggest that models struggle to generalize on news and code data, but work especially well on arXiv papers. We also find the context size and tokenization implementation have a big impact of on the overall compression performance.

View paper on

Share this with someone who'll enjoy it:

Title:Evaluating Large Language Models for Generalization and Robustness via Data Compression

Paper and Code