Picture for Eric Wallace

Eric Wallace

Tony

OpenAI o1 System Card

Add code
Dec 21, 2024
Viaarxiv icon

Deliberative Alignment: Reasoning Enables Safer Language Models

Add code
Dec 20, 2024
Viaarxiv icon

Predicting Emergent Capabilities by Finetuning

Add code
Nov 25, 2024
Figure 1 for Predicting Emergent Capabilities by Finetuning
Figure 2 for Predicting Emergent Capabilities by Finetuning
Figure 3 for Predicting Emergent Capabilities by Finetuning
Figure 4 for Predicting Emergent Capabilities by Finetuning
Viaarxiv icon

GPT-4o System Card

Add code
Oct 25, 2024
Viaarxiv icon

Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

Add code
Jun 28, 2024
Figure 1 for Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
Figure 2 for Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
Figure 3 for Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
Figure 4 for Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
Viaarxiv icon

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Add code
Apr 19, 2024
Viaarxiv icon

Unfamiliar Finetuning Examples Control How Language Models Hallucinate

Add code
Mar 08, 2024
Figure 1 for Unfamiliar Finetuning Examples Control How Language Models Hallucinate
Figure 2 for Unfamiliar Finetuning Examples Control How Language Models Hallucinate
Figure 3 for Unfamiliar Finetuning Examples Control How Language Models Hallucinate
Figure 4 for Unfamiliar Finetuning Examples Control How Language Models Hallucinate
Viaarxiv icon

What Evidence Do Language Models Find Convincing?

Add code
Feb 19, 2024
Viaarxiv icon

Scalable Extraction of Training Data from (Production) Language Models

Add code
Nov 28, 2023
Figure 1 for Scalable Extraction of Training Data from (Production) Language Models
Figure 2 for Scalable Extraction of Training Data from (Production) Language Models
Figure 3 for Scalable Extraction of Training Data from (Production) Language Models
Figure 4 for Scalable Extraction of Training Data from (Production) Language Models
Viaarxiv icon

Privacy Side Channels in Machine Learning Systems

Add code
Sep 11, 2023
Viaarxiv icon