Picture for Stanislav Fort

Stanislav Fort

Direct Ascent Synthesis: Revealing Hidden Generative Capabilities in Discriminative Models

Add code
Feb 11, 2025
Viaarxiv icon

A Note on Implementation Errors in Recent Adaptive Attacks Against Multi-Resolution Self-Ensembles

Add code
Jan 24, 2025
Figure 1 for A Note on Implementation Errors in Recent Adaptive Attacks Against Multi-Resolution Self-Ensembles
Figure 2 for A Note on Implementation Errors in Recent Adaptive Attacks Against Multi-Resolution Self-Ensembles
Viaarxiv icon

Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness

Add code
Aug 08, 2024
Viaarxiv icon

Scaling Laws for Adversarial Attacks on Language Model Activations

Add code
Dec 05, 2023
Viaarxiv icon

Multi-attacks: Many images $+$ the same adversarial attack $\to$ many target labels

Add code
Aug 04, 2023
Viaarxiv icon

Constitutional AI: Harmlessness from AI Feedback

Add code
Dec 15, 2022
Figure 1 for Constitutional AI: Harmlessness from AI Feedback
Figure 2 for Constitutional AI: Harmlessness from AI Feedback
Figure 3 for Constitutional AI: Harmlessness from AI Feedback
Figure 4 for Constitutional AI: Harmlessness from AI Feedback
Viaarxiv icon

Measuring Progress on Scalable Oversight for Large Language Models

Add code
Nov 11, 2022
Figure 1 for Measuring Progress on Scalable Oversight for Large Language Models
Figure 2 for Measuring Progress on Scalable Oversight for Large Language Models
Figure 3 for Measuring Progress on Scalable Oversight for Large Language Models
Viaarxiv icon

What does a deep neural network confidently perceive? The effective dimension of high certainty class manifolds and their low confidence boundaries

Add code
Oct 11, 2022
Figure 1 for What does a deep neural network confidently perceive? The effective dimension of high certainty class manifolds and their low confidence boundaries
Figure 2 for What does a deep neural network confidently perceive? The effective dimension of high certainty class manifolds and their low confidence boundaries
Figure 3 for What does a deep neural network confidently perceive? The effective dimension of high certainty class manifolds and their low confidence boundaries
Figure 4 for What does a deep neural network confidently perceive? The effective dimension of high certainty class manifolds and their low confidence boundaries
Viaarxiv icon

Language Models (Mostly) Know What They Know

Add code
Jul 16, 2022
Figure 1 for Language Models (Mostly) Know What They Know
Figure 2 for Language Models (Mostly) Know What They Know
Figure 3 for Language Models (Mostly) Know What They Know
Figure 4 for Language Models (Mostly) Know What They Know
Viaarxiv icon

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Add code
Apr 12, 2022
Figure 1 for Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Figure 2 for Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Figure 3 for Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Figure 4 for Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Viaarxiv icon