Picture for He He

He He

Jailbreak Strength and Model Similarity Predict Transferability

Add code
Jun 15, 2025
Viaarxiv icon

Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors

Add code
Jun 12, 2025
Viaarxiv icon

Unsupervised Elicitation of Language Models

Add code
Jun 11, 2025
Viaarxiv icon

Beyond Memorization: Mapping the Originality-Quality Frontier of Language Models

Add code
Apr 13, 2025
Viaarxiv icon

Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification

Add code
Apr 07, 2025
Viaarxiv icon

Transformers Struggle to Learn to Search

Add code
Dec 06, 2024
Figure 1 for Transformers Struggle to Learn to Search
Figure 2 for Transformers Struggle to Learn to Search
Figure 3 for Transformers Struggle to Learn to Search
Figure 4 for Transformers Struggle to Learn to Search
Viaarxiv icon

Beyond the Binary: Capturing Diverse Preferences With Reward Regularization

Add code
Dec 05, 2024
Figure 1 for Beyond the Binary: Capturing Diverse Preferences With Reward Regularization
Figure 2 for Beyond the Binary: Capturing Diverse Preferences With Reward Regularization
Figure 3 for Beyond the Binary: Capturing Diverse Preferences With Reward Regularization
Viaarxiv icon

Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats

Add code
Nov 26, 2024
Figure 1 for Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
Figure 2 for Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
Figure 3 for Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
Figure 4 for Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
Viaarxiv icon

Spontaneous Reward Hacking in Iterative Self-Refinement

Add code
Jul 05, 2024
Figure 1 for Spontaneous Reward Hacking in Iterative Self-Refinement
Figure 2 for Spontaneous Reward Hacking in Iterative Self-Refinement
Figure 3 for Spontaneous Reward Hacking in Iterative Self-Refinement
Figure 4 for Spontaneous Reward Hacking in Iterative Self-Refinement
Viaarxiv icon

LLMs Are Prone to Fallacies in Causal Inference

Add code
Jun 18, 2024
Figure 1 for LLMs Are Prone to Fallacies in Causal Inference
Figure 2 for LLMs Are Prone to Fallacies in Causal Inference
Figure 3 for LLMs Are Prone to Fallacies in Causal Inference
Figure 4 for LLMs Are Prone to Fallacies in Causal Inference
Viaarxiv icon