Picture for Nouha Dziri

Nouha Dziri

SafetyAnalyst: Interpretable, transparent, and steerable LLM safety moderation

Add code
Oct 22, 2024
Viaarxiv icon

To Err is AI : A Case Study Informing LLM Flaw Reporting Practices

Add code
Oct 15, 2024
Viaarxiv icon

Steering Masked Discrete Diffusion Models via Discrete Denoising Posterior Prediction

Add code
Oct 10, 2024
Figure 1 for Steering Masked Discrete Diffusion Models via Discrete Denoising Posterior Prediction
Figure 2 for Steering Masked Discrete Diffusion Models via Discrete Denoising Posterior Prediction
Figure 3 for Steering Masked Discrete Diffusion Models via Discrete Denoising Posterior Prediction
Figure 4 for Steering Masked Discrete Diffusion Models via Discrete Denoising Posterior Prediction
Viaarxiv icon

AI as Humanity's Salieri: Quantifying Linguistic Creativity of Language Models via Systematic Attribution of Machine Text against Web Text

Add code
Oct 05, 2024
Viaarxiv icon

Rel-A.I.: An Interaction-Centered Approach To Measuring Human-LM Reliance

Add code
Jul 10, 2024
Viaarxiv icon

WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models

Add code
Jun 26, 2024
Figure 1 for WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models
Figure 2 for WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models
Figure 3 for WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models
Figure 4 for WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models
Viaarxiv icon

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

Add code
Jun 26, 2024
Figure 1 for WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
Figure 2 for WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
Figure 3 for WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
Figure 4 for WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
Viaarxiv icon

WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

Add code
Jun 07, 2024
Viaarxiv icon

CULTURE-GEN: Revealing Global Cultural Perception in Language Models through Natural Language Prompting

Add code
Apr 16, 2024
Viaarxiv icon

RewardBench: Evaluating Reward Models for Language Modeling

Add code
Mar 20, 2024
Viaarxiv icon