Picture for Emma Strubell

Emma Strubell

Gradient Localization Improves Lifelong Pretraining of Language Models

Add code
Nov 07, 2024
Viaarxiv icon

Collage: Decomposable Rapid Prototyping for Information Extraction on Scientific PDFs

Add code
Oct 30, 2024
Figure 1 for Collage: Decomposable Rapid Prototyping for Information Extraction on Scientific PDFs
Figure 2 for Collage: Decomposable Rapid Prototyping for Information Extraction on Scientific PDFs
Figure 3 for Collage: Decomposable Rapid Prototyping for Information Extraction on Scientific PDFs
Figure 4 for Collage: Decomposable Rapid Prototyping for Information Extraction on Scientific PDFs
Viaarxiv icon

Scalable Data Ablation Approximations for Language Models through Modular Training and Merging

Add code
Oct 21, 2024
Figure 1 for Scalable Data Ablation Approximations for Language Models through Modular Training and Merging
Figure 2 for Scalable Data Ablation Approximations for Language Models through Modular Training and Merging
Figure 3 for Scalable Data Ablation Approximations for Language Models through Modular Training and Merging
Figure 4 for Scalable Data Ablation Approximations for Language Models through Modular Training and Merging
Viaarxiv icon

Stereotype or Personalization? User Identity Biases Chatbot Recommendations

Add code
Oct 08, 2024
Viaarxiv icon

What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions

Add code
May 22, 2024
Viaarxiv icon

Carbon Connect: An Ecosystem for Sustainable Computing

Add code
May 22, 2024
Viaarxiv icon

Source-Aware Training Enables Knowledge Attribution in Language Models

Add code
Apr 11, 2024
Viaarxiv icon

OLMo: Accelerating the Science of Language Models

Add code
Feb 07, 2024
Figure 1 for OLMo: Accelerating the Science of Language Models
Figure 2 for OLMo: Accelerating the Science of Language Models
Figure 3 for OLMo: Accelerating the Science of Language Models
Figure 4 for OLMo: Accelerating the Science of Language Models
Viaarxiv icon

Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

Add code
Jan 31, 2024
Figure 1 for Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
Figure 2 for Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
Figure 3 for Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
Figure 4 for Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
Viaarxiv icon

AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters

Add code
Jan 16, 2024
Viaarxiv icon