Picture for Taesung Lee

Taesung Lee

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Add code
Jan 31, 2025
Viaarxiv icon

Towards Generating Informative Textual Description for Neurons in Language Models

Add code
Jan 30, 2024
Viaarxiv icon

URET: Universal Robustness Evaluation Toolkit (for Evasion)

Add code
Aug 03, 2023
Viaarxiv icon

Matching Pairs: Attributing Fine-Tuned Models to their Pre-Trained Large Language Models

Add code
Jun 15, 2023
Viaarxiv icon

Robustness of Explanation Methods for NLP Models

Add code
Jun 24, 2022
Figure 1 for Robustness of Explanation Methods for NLP Models
Figure 2 for Robustness of Explanation Methods for NLP Models
Figure 3 for Robustness of Explanation Methods for NLP Models
Figure 4 for Robustness of Explanation Methods for NLP Models
Viaarxiv icon

Adaptive Verifiable Training Using Pairwise Class Similarity

Add code
Dec 14, 2020
Figure 1 for Adaptive Verifiable Training Using Pairwise Class Similarity
Figure 2 for Adaptive Verifiable Training Using Pairwise Class Similarity
Figure 3 for Adaptive Verifiable Training Using Pairwise Class Similarity
Figure 4 for Adaptive Verifiable Training Using Pairwise Class Similarity
Viaarxiv icon

A new measure for overfitting and its implications for backdooring of deep learning

Add code
Jun 18, 2020
Figure 1 for A new measure for overfitting and its implications for backdooring of deep learning
Figure 2 for A new measure for overfitting and its implications for backdooring of deep learning
Figure 3 for A new measure for overfitting and its implications for backdooring of deep learning
Figure 4 for A new measure for overfitting and its implications for backdooring of deep learning
Viaarxiv icon

Detecting Backdoor Attacks on Deep Neural Networks by Activation Clustering

Add code
Nov 09, 2018
Figure 1 for Detecting Backdoor Attacks on Deep Neural Networks by Activation Clustering
Figure 2 for Detecting Backdoor Attacks on Deep Neural Networks by Activation Clustering
Figure 3 for Detecting Backdoor Attacks on Deep Neural Networks by Activation Clustering
Figure 4 for Detecting Backdoor Attacks on Deep Neural Networks by Activation Clustering
Viaarxiv icon

Defending Against Model Stealing Attacks Using Deceptive Perturbations

Add code
Sep 19, 2018
Figure 1 for Defending Against Model Stealing Attacks Using Deceptive Perturbations
Figure 2 for Defending Against Model Stealing Attacks Using Deceptive Perturbations
Figure 3 for Defending Against Model Stealing Attacks Using Deceptive Perturbations
Figure 4 for Defending Against Model Stealing Attacks Using Deceptive Perturbations
Viaarxiv icon