Picture for Stanislav Fort

Stanislav Fort

Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness

Add code
Aug 08, 2024
Viaarxiv icon

Scaling Laws for Adversarial Attacks on Language Model Activations

Add code
Dec 05, 2023
Viaarxiv icon

Multi-attacks: Many images $+$ the same adversarial attack $\to$ many target labels

Add code
Aug 04, 2023
Viaarxiv icon

Constitutional AI: Harmlessness from AI Feedback

Add code
Dec 15, 2022
Figure 1 for Constitutional AI: Harmlessness from AI Feedback
Figure 2 for Constitutional AI: Harmlessness from AI Feedback
Figure 3 for Constitutional AI: Harmlessness from AI Feedback
Figure 4 for Constitutional AI: Harmlessness from AI Feedback
Viaarxiv icon

Measuring Progress on Scalable Oversight for Large Language Models

Add code
Nov 11, 2022
Figure 1 for Measuring Progress on Scalable Oversight for Large Language Models
Figure 2 for Measuring Progress on Scalable Oversight for Large Language Models
Figure 3 for Measuring Progress on Scalable Oversight for Large Language Models
Viaarxiv icon

What does a deep neural network confidently perceive? The effective dimension of high certainty class manifolds and their low confidence boundaries

Add code
Oct 11, 2022
Figure 1 for What does a deep neural network confidently perceive? The effective dimension of high certainty class manifolds and their low confidence boundaries
Figure 2 for What does a deep neural network confidently perceive? The effective dimension of high certainty class manifolds and their low confidence boundaries
Figure 3 for What does a deep neural network confidently perceive? The effective dimension of high certainty class manifolds and their low confidence boundaries
Figure 4 for What does a deep neural network confidently perceive? The effective dimension of high certainty class manifolds and their low confidence boundaries
Viaarxiv icon

Language Models (Mostly) Know What They Know

Add code
Jul 16, 2022
Figure 1 for Language Models (Mostly) Know What They Know
Figure 2 for Language Models (Mostly) Know What They Know
Figure 3 for Language Models (Mostly) Know What They Know
Figure 4 for Language Models (Mostly) Know What They Know
Viaarxiv icon

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Add code
Apr 12, 2022
Figure 1 for Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Figure 2 for Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Figure 3 for Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Figure 4 for Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Viaarxiv icon

Adversarial vulnerability of powerful near out-of-distribution detection

Add code
Jan 18, 2022
Viaarxiv icon

How many degrees of freedom do we need to train deep networks: a loss landscape perspective

Add code
Jul 13, 2021
Figure 1 for How many degrees of freedom do we need to train deep networks: a loss landscape perspective
Figure 2 for How many degrees of freedom do we need to train deep networks: a loss landscape perspective
Figure 3 for How many degrees of freedom do we need to train deep networks: a loss landscape perspective
Figure 4 for How many degrees of freedom do we need to train deep networks: a loss landscape perspective
Viaarxiv icon