Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Albert No

R-TOFU: Unlearning in Large Reasoning Models

May 21, 2025

Sangyeon Yoon, Wonje Jeung, Albert No

Abstract:Large Reasoning Models (LRMs) embed private or copyrighted information not only in their final answers but also throughout multi-step chain-of-thought (CoT) traces, making reliable unlearning far more demanding than in standard LLMs. We introduce Reasoning-TOFU (R-TOFU), the first benchmark tailored to this setting. R-TOFU augments existing unlearning tasks with realistic CoT annotations and provides step-wise metrics that expose residual knowledge invisible to answer-level checks. Using R-TOFU, we carry out a comprehensive comparison of gradient-based and preference-optimization baselines and show that conventional answer-only objectives leave substantial forget traces in reasoning. We further propose Reasoned IDK, a preference-optimization variant that preserves coherent yet inconclusive reasoning, achieving a stronger balance between forgetting efficacy and model utility than earlier refusal styles. Finally, we identify a failure mode: decoding variants such as ZeroThink and LessThink can still reveal forgotten content despite seemingly successful unlearning, emphasizing the need to evaluate models under diverse decoding settings. Together, the benchmark, analysis, and new baseline establish a systematic foundation for studying and improving unlearning in LRMs while preserving their reasoning capabilities.

* 19 pages

Via

Access Paper or Ask Questions

DUSK: Do Not Unlearn Shared Knowledge

May 21, 2025

Wonje Jeung, Sangyeon Yoon, Hyesoo Hong, Soeun Kim, Seungju Han, Youngjae Yu, Albert No

Abstract:Large language models (LLMs) are increasingly deployed in real-world applications, raising concerns about the unauthorized use of copyrighted or sensitive data. Machine unlearning aims to remove such 'forget' data while preserving utility and information from the 'retain' set. However, existing evaluations typically assume that forget and retain sets are fully disjoint, overlooking realistic scenarios where they share overlapping content. For instance, a news article may need to be unlearned, even though the same event, such as an earthquake in Japan, is also described factually on Wikipedia. Effective unlearning should remove the specific phrasing of the news article while preserving publicly supported facts. In this paper, we introduce DUSK, a benchmark designed to evaluate unlearning methods under realistic data overlap. DUSK constructs document sets that describe the same factual content in different styles, with some shared information appearing across all sets and other content remaining unique to each. When one set is designated for unlearning, an ideal method should remove its unique content while preserving shared facts. We define seven evaluation metrics to assess whether unlearning methods can achieve this selective removal. Our evaluation of nine recent unlearning methods reveals a key limitation: while most can remove surface-level text, they often fail to erase deeper, context-specific knowledge without damaging shared content. We release DUSK as a public benchmark to support the development of more precise and reliable unlearning techniques for real-world applications.

* 21 pages

Via

Access Paper or Ask Questions

SEPS: A Separability Measure for Robust Unlearning in LLMs

May 20, 2025

Wonje Jeung, Sangyeon Yoon, Albert No

Abstract:Machine unlearning aims to selectively remove targeted knowledge from Large Language Models (LLMs), ensuring they forget specified content while retaining essential information. Existing unlearning metrics assess whether a model correctly answers retain queries and rejects forget queries, but they fail to capture real-world scenarios where forget queries rarely appear in isolation. In fact, forget and retain queries often coexist within the same prompt, making mixed-query evaluation crucial. We introduce SEPS, an evaluation framework that explicitly measures a model's ability to both forget and retain information within a single prompt. Through extensive experiments across three benchmarks, we identify two key failure modes in existing unlearning methods: (1) untargeted unlearning indiscriminately erases both forget and retain content once a forget query appears, and (2) targeted unlearning overfits to single-query scenarios, leading to catastrophic failures when handling multiple queries. To address these issues, we propose Mixed Prompt (MP) unlearning, a strategy that integrates both forget and retain queries into a unified training objective. Our approach significantly improves unlearning effectiveness, demonstrating robustness even in complex settings with up to eight mixed forget and retain queries in a single prompt.

* 32 pages

Via

Access Paper or Ask Questions

SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment

May 20, 2025

Wonje Jeung, Sangyeon Yoon, Minsuk Kahng, Albert No

Abstract:Large Reasoning Models (LRMs) have become powerful tools for complex problem solving, but their structured reasoning pathways can lead to unsafe outputs when exposed to harmful prompts. Existing safety alignment methods reduce harmful outputs but can degrade reasoning depth, leading to significant trade-offs in complex, multi-step tasks, and remain vulnerable to sophisticated jailbreak attacks. To address this, we introduce SAFEPATH, a lightweight alignment method that fine-tunes LRMs to emit a short, 8-token Safety Primer at the start of their reasoning, in response to harmful prompts, while leaving the rest of the reasoning process unsupervised. Empirical results across multiple benchmarks indicate that SAFEPATH effectively reduces harmful outputs while maintaining reasoning performance. Specifically, SAFEPATH reduces harmful responses by up to 90.0% and blocks 83.3% of jailbreak attempts in the DeepSeek-R1-Distill-Llama-8B model, while requiring 295.9x less compute than Direct Refusal and 314.1x less than SafeChain. We further introduce a zero-shot variant that requires no fine-tuning. In addition, we provide a comprehensive analysis of how existing methods in LLMs generalize, or fail, when applied to reasoning-centric models, revealing critical gaps and new directions for safer AI.

* 22 pages

Via

Access Paper or Ask Questions

Understanding Memorization in Generative Models via Sharpness in Probability Landscapes

Dec 05, 2024

Dongjae Jeon, Dueun Kim, Albert No

Abstract:In this paper, we introduce a geometric framework to analyze memorization in diffusion models using the eigenvalues of the Hessian of the log probability density. We propose that memorization arises from isolated points in the learned probability distribution, characterized by sharpness in the probability landscape, as indicated by large negative eigenvalues of the Hessian. Through experiments on various datasets, we demonstrate that these eigenvalues effectively detect and quantify memorization. Our approach provides a clear understanding of memorization in diffusion models and lays the groundwork for developing strategies to ensure secure and reliable generative models

Via

Access Paper or Ask Questions

Adversarial Sample-Based Approach for Tighter Privacy Auditing in Final Model-Only Scenarios

Dec 02, 2024

Sangyeon Yoon, Wonje Jeung, Albert No

Figure 1 for Adversarial Sample-Based Approach for Tighter Privacy Auditing in Final Model-Only Scenarios

Figure 2 for Adversarial Sample-Based Approach for Tighter Privacy Auditing in Final Model-Only Scenarios

Figure 3 for Adversarial Sample-Based Approach for Tighter Privacy Auditing in Final Model-Only Scenarios

Figure 4 for Adversarial Sample-Based Approach for Tighter Privacy Auditing in Final Model-Only Scenarios

Abstract:Auditing Differentially Private Stochastic Gradient Descent (DP-SGD) in the final model setting is challenging and often results in empirical lower bounds that are significantly looser than theoretical privacy guarantees. We introduce a novel auditing method that achieves tighter empirical lower bounds without additional assumptions by crafting worst-case adversarial samples through loss-based input-space auditing. Our approach surpasses traditional canary-based heuristics and is effective in both white-box and black-box scenarios. Specifically, with a theoretical privacy budget of $\varepsilon = 10.0$, our method achieves empirical lower bounds of $6.68$ in white-box settings and $4.51$ in black-box settings, compared to the baseline of $4.11$ for MNIST. Moreover, we demonstrate that significant privacy auditing results can be achieved using in-distribution (ID) samples as canaries, obtaining an empirical lower bound of $4.33$ where traditional methods produce near-zero leakage detection. Our work offers a practical framework for reliable and accurate privacy auditing in differentially private machine learning.

* 13 pages, 8 figures, NeurIPS (SFLLM Workshop)

Via

Access Paper or Ask Questions

Differentially Private Adaptation of Diffusion Models via Noisy Aggregated Embeddings

Nov 22, 2024

Pura Peetathawatchai, Wei-Ning Chen, Berivan Isik, Sanmi Koyejo, Albert No

Figure 1 for Differentially Private Adaptation of Diffusion Models via Noisy Aggregated Embeddings

Figure 2 for Differentially Private Adaptation of Diffusion Models via Noisy Aggregated Embeddings

Figure 3 for Differentially Private Adaptation of Diffusion Models via Noisy Aggregated Embeddings

Figure 4 for Differentially Private Adaptation of Diffusion Models via Noisy Aggregated Embeddings

Abstract:We introduce novel methods for adapting diffusion models under differential privacy (DP) constraints, enabling privacy-preserving style and content transfer without fine-tuning. Traditional approaches to private adaptation, such as DP-SGD, incur significant computational overhead and degrade model performance when applied to large, complex models. Our approach instead leverages embedding-based techniques: Universal Guidance and Textual Inversion (TI), adapted with differentially private mechanisms. We apply these methods to Stable Diffusion for style adaptation using two private datasets: a collection of artworks by a single artist and pictograms from the Paris 2024 Olympics. Experimental results show that the TI-based adaptation achieves superior fidelity in style transfer, even under strong privacy guarantees, while both methods maintain high privacy resilience by employing calibrated noise and subsampling strategies. Our findings demonstrate a feasible and efficient pathway for privacy-preserving diffusion model adaptation, balancing data protection with the fidelity of generated images, and offer insights into embedding-driven methods for DP in generative AI applications.

Via

Access Paper or Ask Questions

An Information Theoretic Metric for Evaluating Unlearning Models

May 28, 2024

Dongjae Jeon, Wonje Jeung, Taeheon Kim, Albert No, Jonghyun Choi

Figure 1 for An Information Theoretic Metric for Evaluating Unlearning Models

Figure 2 for An Information Theoretic Metric for Evaluating Unlearning Models

Figure 3 for An Information Theoretic Metric for Evaluating Unlearning Models

Figure 4 for An Information Theoretic Metric for Evaluating Unlearning Models

Abstract:Machine unlearning (MU) addresses privacy concerns by removing information of `forgetting data' samples from trained models. Typically, evaluating MU methods involves comparing unlearned models to those retrained from scratch without forgetting data, using metrics such as membership inference attacks (MIA) and accuracy measurements. These evaluations implicitly assume that if the output logits of the unlearned and retrained models are similar, the unlearned model has successfully forgotten the data. Here, we challenge if this assumption is valid. In particular, we conduct a simple experiment of training only the last layer of a given original model using a novel masked-distillation technique while keeping the rest fixed. Surprisingly, simply altering the last layer yields favorable outcomes in the existing evaluation metrics, while the model does not successfully unlearn the samples or classes. For better evaluating the MU methods, we propose a metric that quantifies the residual information about forgetting data samples in intermediate features using mutual information, called information difference index or IDI for short. The IDI provides a comprehensive evaluation of MU methods by efficiently analyzing the internal structure of DNNs. Our metric is scalable to large datasets and adaptable to various model architectures. Additionally, we present COLapse-and-Align (COLA), a simple contrastive-based method that effectively unlearns intermediate features.

Via

Access Paper or Ask Questions

Simple Drop-in LoRA Conditioning on Attention Layers Will Improve Your Diffusion Model

May 07, 2024

Joo Young Choi, Jaesung R. Park, Inkyu Park, Jaewoong Cho, Albert No, Ernest K. Ryu

Figure 1 for Simple Drop-in LoRA Conditioning on Attention Layers Will Improve Your Diffusion Model

Figure 2 for Simple Drop-in LoRA Conditioning on Attention Layers Will Improve Your Diffusion Model

Figure 3 for Simple Drop-in LoRA Conditioning on Attention Layers Will Improve Your Diffusion Model

Figure 4 for Simple Drop-in LoRA Conditioning on Attention Layers Will Improve Your Diffusion Model

Abstract:Current state-of-the-art diffusion models employ U-Net architectures containing convolutional and (qkv) self-attention layers. The U-Net processes images while being conditioned on the time embedding input for each sampling step and the class or caption embedding input corresponding to the desired conditional generation. Such conditioning involves scale-and-shift operations to the convolutional layers but does not directly affect the attention layers. While these standard architectural choices are certainly effective, not conditioning the attention layers feels arbitrary and potentially suboptimal. In this work, we show that simply adding LoRA conditioning to the attention layers without changing or tuning the other parts of the U-Net architecture improves the image generation quality. For example, a drop-in addition of LoRA conditioning to EDM diffusion model yields FID scores of 1.91/1.75 for unconditional and class-conditional CIFAR-10 generation, improving upon the baseline of 1.97/1.79.

Via

Access Paper or Ask Questions

Improved Communication-Privacy Trade-offs in $L_2$ Mean Estimation under Streaming Differential Privacy

May 02, 2024

Wei-Ning Chen, Berivan Isik, Peter Kairouz, Albert No, Sewoong Oh, Zheng Xu

Figure 1 for Improved Communication-Privacy Trade-offs in $L_2$ Mean Estimation under Streaming Differential Privacy

Figure 2 for Improved Communication-Privacy Trade-offs in $L_2$ Mean Estimation under Streaming Differential Privacy

Figure 3 for Improved Communication-Privacy Trade-offs in $L_2$ Mean Estimation under Streaming Differential Privacy

Figure 4 for Improved Communication-Privacy Trade-offs in $L_2$ Mean Estimation under Streaming Differential Privacy

Abstract:We study $L_2$ mean estimation under central differential privacy and communication constraints, and address two key challenges: firstly, existing mean estimation schemes that simultaneously handle both constraints are usually optimized for $L_\infty$ geometry and rely on random rotation or Kashin's representation to adapt to $L_2$ geometry, resulting in suboptimal leading constants in mean square errors (MSEs); secondly, schemes achieving order-optimal communication-privacy trade-offs do not extend seamlessly to streaming differential privacy (DP) settings (e.g., tree aggregation or matrix factorization), rendering them incompatible with DP-FTRL type optimizers. In this work, we tackle these issues by introducing a novel privacy accounting method for the sparsified Gaussian mechanism that incorporates the randomness inherent in sparsification into the DP noise. Unlike previous approaches, our accounting algorithm directly operates in $L_2$ geometry, yielding MSEs that fast converge to those of the uncompressed Gaussian mechanism. Additionally, we extend the sparsification scheme to the matrix factorization framework under streaming DP and provide a precise accountant tailored for DP-FTRL type optimizers. Empirically, our method demonstrates at least a 100x improvement of compression for DP-SGD across various FL tasks.

Via

Access Paper or Ask Questions