Picture for Kavel Rao

Kavel Rao

To Err is AI : A Case Study Informing LLM Flaw Reporting Practices

Add code
Oct 15, 2024
Viaarxiv icon

WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models

Add code
Jun 26, 2024
Figure 1 for WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models
Figure 2 for WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models
Figure 3 for WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models
Figure 4 for WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models
Viaarxiv icon

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

Add code
Jun 26, 2024
Figure 1 for WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
Figure 2 for WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
Figure 3 for WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
Figure 4 for WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
Viaarxiv icon

What Makes it Ok to Set a Fire? Iterative Self-distillation of Contexts and Rationales for Disambiguating Defeasible Social and Moral Situations

Add code
Nov 01, 2023
Viaarxiv icon

Value Kaleidoscope: Engaging AI with Pluralistic Human Values, Rights, and Duties

Add code
Sep 02, 2023
Viaarxiv icon