Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sam Havens

FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents

Apr 17, 2025

Nandan Thakur, Jimmy Lin, Sam Havens, Michael Carbin, Omar Khattab, Andrew Drozdov

Figure 1 for FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents

Figure 2 for FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents

Figure 3 for FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents

Figure 4 for FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents

Abstract:We introduce FreshStack, a reusable framework for automatically building information retrieval (IR) evaluation benchmarks from community-asked questions and answers. FreshStack conducts the following steps: (1) automatic corpus collection from code and technical documentation, (2) nugget generation from community-asked questions and answers, and (3) nugget-level support, retrieving documents using a fusion of retrieval techniques and hybrid architectures. We use FreshStack to build five datasets on fast-growing, recent, and niche topics to ensure the tasks are sufficiently challenging. On FreshStack, existing retrieval models, when applied out-of-the-box, significantly underperform oracle approaches on all five topics, denoting plenty of headroom to improve IR quality. In addition, we identify cases where rerankers do not clearly improve first-stage retrieval accuracy (two out of five topics). We hope that FreshStack will facilitate future work toward constructing realistic, scalable, and uncontaminated IR and RAG evaluation benchmarks. FreshStack datasets are available at: https://fresh-stack.github.io.

Via

Access Paper or Ask Questions

Long Context RAG Performance of Large Language Models

Nov 05, 2024

Quinn Leng, Jacob Portes, Sam Havens, Matei Zaharia, Michael Carbin

Figure 1 for Long Context RAG Performance of Large Language Models

Figure 2 for Long Context RAG Performance of Large Language Models

Figure 3 for Long Context RAG Performance of Large Language Models

Abstract:Retrieval Augmented Generation (RAG) has emerged as a crucial technique for enhancing the accuracy of Large Language Models (LLMs) by incorporating external information. With the advent of LLMs that support increasingly longer context lengths, there is a growing interest in understanding how these models perform in RAG scenarios. Can these new long context models improve RAG performance? This paper presents a comprehensive study of the impact of increased context length on RAG performance across 20 popular open source and commercial LLMs. We ran RAG workflows while varying the total context length from 2,000 to 128,000 tokens (and 2 million tokens when possible) on three domain-specific datasets, and report key insights on the benefits and limitations of long context in RAG applications. Our findings reveal that while retrieving more documents can improve performance, only a handful of the most recent state of the art LLMs can maintain consistent accuracy at long context above 64k tokens. We also identify distinct failure modes in long context scenarios, suggesting areas for future research.

* 2024 NeurIPS workshop on Adaptive Foundation Models: Evolving AI for Personalized and Efficient Learning

Via

Access Paper or Ask Questions

LoRA Learns Less and Forgets Less

May 15, 2024

Dan Biderman, Jose Gonzalez Ortiz, Jacob Portes, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle(+2 more)

Figure 1 for LoRA Learns Less and Forgets Less

Figure 2 for LoRA Learns Less and Forgets Less

Figure 3 for LoRA Learns Less and Forgets Less

Figure 4 for LoRA Learns Less and Forgets Less

Abstract:Low-Rank Adaptation (LoRA) is a widely-used parameter-efficient finetuning method for large language models. LoRA saves memory by training only low rank perturbations to selected weight matrices. In this work, we compare the performance of LoRA and full finetuning on two target domains, programming and mathematics. We consider both the instruction finetuning ($\approx$100K prompt-response pairs) and continued pretraining ($\approx$10B unstructured tokens) data regimes. Our results show that, in most settings, LoRA substantially underperforms full finetuning. Nevertheless, LoRA exhibits a desirable form of regularization: it better maintains the base model's performance on tasks outside the target domain. We show that LoRA provides stronger regularization compared to common techniques such as weight decay and dropout; it also helps maintain more diverse generations. We show that full finetuning learns perturbations with a rank that is 10-100X greater than typical LoRA configurations, possibly explaining some of the reported gaps. We conclude by proposing best practices for finetuning with LoRA.

Via

Access Paper or Ask Questions

MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining

Jan 16, 2024

Jacob Portes, Alex Trott, Sam Havens, Daniel King, Abhinav Venigalla, Moin Nadeem, Nikhil Sardana, Daya Khudia, Jonathan Frankle

Figure 1 for MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining

Figure 2 for MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining

Figure 3 for MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining

Figure 4 for MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining

Abstract:Although BERT-style encoder models are heavily used in NLP research, many researchers do not pretrain their own BERTs from scratch due to the high cost of training. In the past half-decade since BERT first rose to prominence, many advances have been made with other transformer architectures and training configurations that have yet to be systematically incorporated into BERT. Here, we introduce MosaicBERT, a BERT-style encoder architecture and training recipe that is empirically optimized for fast pretraining. This efficient architecture incorporates FlashAttention, Attention with Linear Biases (ALiBi), Gated Linear Units (GLU), a module to dynamically remove padded tokens, and low precision LayerNorm into the classic transformer encoder block. The training recipe includes a 30% masking ratio for the Masked Language Modeling (MLM) objective, bfloat16 precision, and vocabulary size optimized for GPU throughput, in addition to best-practices from RoBERTa and other encoder models. When pretrained from scratch on the C4 dataset, this base model achieves a downstream average GLUE (dev) score of 79.6 in 1.13 hours on 8 A100 80 GB GPUs at a cost of roughly $20. We plot extensive accuracy vs. pretraining speed Pareto curves and show that MosaicBERT base and large are consistently Pareto optimal when compared to a competitive BERT base and large. This empirical speed up in pretraining enables researchers and engineers to pretrain custom BERT-style models at low cost instead of finetune on existing generic models. We open source our model weights and code.

* NeurIPS 2023
* 10 pages, 4 figures in main text. 25 pages total

Via

Access Paper or Ask Questions

LIMIT: Less Is More for Instruction Tuning Across Evaluation Paradigms

Nov 22, 2023

Aditi Jha, Sam Havens, Jeremey Dohmann, Alex Trott, Jacob Portes

Figure 1 for LIMIT: Less Is More for Instruction Tuning Across Evaluation Paradigms

Figure 2 for LIMIT: Less Is More for Instruction Tuning Across Evaluation Paradigms

Figure 3 for LIMIT: Less Is More for Instruction Tuning Across Evaluation Paradigms

Figure 4 for LIMIT: Less Is More for Instruction Tuning Across Evaluation Paradigms

Abstract:Large Language Models are traditionally finetuned on large instruction datasets. However recent studies suggest that small, high-quality datasets can suffice for general purpose instruction following. This lack of consensus surrounding finetuning best practices is in part due to rapidly diverging approaches to LLM evaluation. In this study, we ask whether a small amount of diverse finetuning samples can improve performance on both traditional perplexity-based NLP benchmarks, and on open-ended, model-based evaluation. We finetune open-source MPT-7B and MPT-30B models on instruction finetuning datasets of various sizes ranging from 1k to 60k samples. We find that subsets of 1k-6k instruction finetuning samples are sufficient to achieve good performance on both (1) traditional NLP benchmarks and (2) model-based evaluation. Finally, we show that mixing textbook-style and open-ended QA finetuning datasets optimizes performance on both evaluation paradigms.

* 36 pages, 12 figures, NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following

Via

Access Paper or Ask Questions