Picture for Jacob Hilton

Jacob Hilton

Shammie

Obfuscated Activations Bypass LLM Latent-Space Defenses

Add code
Dec 12, 2024
Viaarxiv icon

Estimating the Probabilities of Rare Outputs in Language Models

Add code
Oct 17, 2024
Figure 1 for Estimating the Probabilities of Rare Outputs in Language Models
Figure 2 for Estimating the Probabilities of Rare Outputs in Language Models
Figure 3 for Estimating the Probabilities of Rare Outputs in Language Models
Figure 4 for Estimating the Probabilities of Rare Outputs in Language Models
Viaarxiv icon

Towards a Law of Iterated Expectations for Heuristic Estimators

Add code
Oct 02, 2024
Figure 1 for Towards a Law of Iterated Expectations for Heuristic Estimators
Figure 2 for Towards a Law of Iterated Expectations for Heuristic Estimators
Viaarxiv icon

Backdoor defense, learnability and obfuscation

Add code
Sep 04, 2024
Viaarxiv icon

Scaling laws for single-agent reinforcement learning

Add code
Jan 31, 2023
Viaarxiv icon

Scaling Laws for Reward Model Overoptimization

Add code
Oct 19, 2022
Figure 1 for Scaling Laws for Reward Model Overoptimization
Figure 2 for Scaling Laws for Reward Model Overoptimization
Figure 3 for Scaling Laws for Reward Model Overoptimization
Figure 4 for Scaling Laws for Reward Model Overoptimization
Viaarxiv icon

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Add code
Jun 10, 2022
Viaarxiv icon

Teaching Models to Express Their Uncertainty in Words

Add code
May 28, 2022
Figure 1 for Teaching Models to Express Their Uncertainty in Words
Figure 2 for Teaching Models to Express Their Uncertainty in Words
Figure 3 for Teaching Models to Express Their Uncertainty in Words
Figure 4 for Teaching Models to Express Their Uncertainty in Words
Viaarxiv icon

Training language models to follow instructions with human feedback

Add code
Mar 04, 2022
Figure 1 for Training language models to follow instructions with human feedback
Figure 2 for Training language models to follow instructions with human feedback
Figure 3 for Training language models to follow instructions with human feedback
Figure 4 for Training language models to follow instructions with human feedback
Viaarxiv icon

WebGPT: Browser-assisted question-answering with human feedback

Add code
Dec 17, 2021
Figure 1 for WebGPT: Browser-assisted question-answering with human feedback
Figure 2 for WebGPT: Browser-assisted question-answering with human feedback
Figure 3 for WebGPT: Browser-assisted question-answering with human feedback
Figure 4 for WebGPT: Browser-assisted question-answering with human feedback
Viaarxiv icon