Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Anupam Datta

De-amplifying Bias from Differential Privacy in Language Model Fine-tuning

Feb 07, 2024

Sanjari Srivastava, Piotr Mardziel, Zhikhun Zhang, Archana Ahlawat, Anupam Datta, John C Mitchell

Figure 1 for De-amplifying Bias from Differential Privacy in Language Model Fine-tuning

Figure 2 for De-amplifying Bias from Differential Privacy in Language Model Fine-tuning

Figure 3 for De-amplifying Bias from Differential Privacy in Language Model Fine-tuning

Figure 4 for De-amplifying Bias from Differential Privacy in Language Model Fine-tuning

Abstract:Fairness and privacy are two important values machine learning (ML) practitioners often seek to operationalize in models. Fairness aims to reduce model bias for social/demographic sub-groups. Privacy via differential privacy (DP) mechanisms, on the other hand, limits the impact of any individual's training data on the resulting model. The trade-offs between privacy and fairness goals of trustworthy ML pose a challenge to those wishing to address both. We show that DP amplifies gender, racial, and religious bias when fine-tuning large language models (LLMs), producing models more biased than ones fine-tuned without DP. We find the cause of the amplification to be a disparity in convergence of gradients across sub-groups. Through the case of binary gender bias, we demonstrate that Counterfactual Data Augmentation (CDA), a known method for addressing bias, also mitigates bias amplification by DP. As a consequence, DP and CDA together can be used to fine-tune models while maintaining both fairness and privacy.

Via

Access Paper or Ask Questions

Is Certifying $\ell_p$ Robustness Still Worthwhile?

Oct 13, 2023

Ravi Mangal, Klas Leino, Zifan Wang, Kai Hu, Weicheng Yu, Corina Pasareanu, Anupam Datta, Matt Fredrikson

$Figure 1 for Is Certifying $\ell_p$ Robustness Still Worthwhile?$

$Figure 2 for Is Certifying $\ell_p$ Robustness Still Worthwhile?$

$Figure 3 for Is Certifying $\ell_p$ Robustness Still Worthwhile?$

$Figure 4 for Is Certifying $\ell_p$ Robustness Still Worthwhile?$

Abstract:Over the years, researchers have developed myriad attacks that exploit the ubiquity of adversarial examples, as well as defenses that aim to guard against the security vulnerabilities posed by such attacks. Of particular interest to this paper are defenses that provide provable guarantees against the class of $\ell_p$-bounded attacks. Certified defenses have made significant progress, taking robustness certification from toy models and datasets to large-scale problems like ImageNet classification. While this is undoubtedly an interesting academic problem, as the field has matured, its impact in practice remains unclear, thus we find it useful to revisit the motivation for continuing this line of research. There are three layers to this inquiry, which we address in this paper: (1) why do we care about robustness research? (2) why do we care about the $\ell_p$-bounded threat model? And (3) why do we care about certification as opposed to empirical defenses? In brief, we take the position that local robustness certification indeed confers practical value to the field of machine learning. We focus especially on the latter two questions from above. With respect to the first of the two, we argue that the $\ell_p$-bounded threat model acts as a minimal requirement for safe application of models in security-critical domains, while at the same time, evidence has mounted suggesting that local robustness may lead to downstream external benefits not immediately related to robustness. As for the second, we argue that (i) certification provides a resolution to the cat-and-mouse game of adversarial attacks; and furthermore, that (ii) perhaps contrary to popular belief, there may not exist a fundamental trade-off between accuracy, robustness, and certifiability, while moreover, certified training techniques constitute a particularly promising way for learning robust models.

Via

Access Paper or Ask Questions

Identifying and Mitigating the Security Risks of Generative AI

Aug 28, 2023

Clark Barrett, Brad Boyd, Ellie Burzstein, Nicholas Carlini, Brad Chen, Jihye Choi, Amrita Roy Chowdhury, Mihai Christodorescu, Anupam Datta, Soheil Feizi(+13 more)

Figure 1 for Identifying and Mitigating the Security Risks of Generative AI

Abstract:Every major technical invention resurfaces the dual-use dilemma -- the new technology has the potential to be used for good as well as for harm. Generative AI (GenAI) techniques, such as large language models (LLMs) and diffusion models, have shown remarkable capabilities (e.g., in-context learning, code-completion, and text-to-image generation and editing). However, GenAI can be used just as well by attackers to generate new attacks and increase the velocity and efficacy of existing attacks. This paper reports the findings of a workshop held at Google (co-organized by Stanford University and the University of Wisconsin-Madison) on the dual-use dilemma posed by GenAI. This paper is not meant to be comprehensive, but is rather an attempt to synthesize some of the interesting findings from the workshop. We discuss short-term and long-term goals for the community on this topic. We hope this paper provides both a launching point for a discussion on this important topic as well as interesting problems that the research community can work to address.

Via

Access Paper or Ask Questions

Order-sensitive Shapley Values for Evaluating Conceptual Soundness of NLP Models

Jun 01, 2022

Kaiji Lu, Anupam Datta

Figure 1 for Order-sensitive Shapley Values for Evaluating Conceptual Soundness of NLP Models

Figure 2 for Order-sensitive Shapley Values for Evaluating Conceptual Soundness of NLP Models

Figure 3 for Order-sensitive Shapley Values for Evaluating Conceptual Soundness of NLP Models

Figure 4 for Order-sensitive Shapley Values for Evaluating Conceptual Soundness of NLP Models

Abstract:Previous works show that deep NLP models are not always conceptually sound: they do not always learn the correct linguistic concepts. Specifically, they can be insensitive to word order. In order to systematically evaluate models for their conceptual soundness with respect to word order, we introduce a new explanation method for sequential data: Order-sensitive Shapley Values (OSV). We conduct an extensive empirical evaluation to validate the method and surface how well various deep NLP models learn word order. Using synthetic data, we first show that OSV is more faithful in explaining model behavior than gradient-based methods. Second, applying to the HANS dataset, we discover that the BERT-based NLI model uses only the word occurrences without word orders. Although simple data augmentation improves accuracy on HANS, OSV shows that the augmented model does not fundamentally improve the model's learning of order. Third, we discover that not all sentiment analysis models learn negation properly: some fail to capture the correct syntax of the negation construct. Finally, we show that pretrained language models such as BERT may rely on the absolute positions of subject words to learn long-range Subject-Verb Agreement. With each NLP task, we also demonstrate how OSV can be leveraged to generate adversarial examples.

Via

Access Paper or Ask Questions

Faithful Explanations for Deep Graph Models

May 24, 2022

Zifan Wang, Yuhang Yao, Chaoran Zhang, Han Zhang, Youjie Kang, Carlee Joe-Wong, Matt Fredrikson, Anupam Datta

Figure 1 for Faithful Explanations for Deep Graph Models

Figure 2 for Faithful Explanations for Deep Graph Models

Figure 3 for Faithful Explanations for Deep Graph Models

Figure 4 for Faithful Explanations for Deep Graph Models

Abstract:This paper studies faithful explanations for Graph Neural Networks (GNNs). First, we provide a new and general method for formally characterizing the faithfulness of explanations for GNNs. It applies to existing explanation methods, including feature attributions and subgraph explanations. Second, our analytical and empirical results demonstrate that feature attribution methods cannot capture the nonlinear effect of edge features, while existing subgraph explanation methods are not faithful. Third, we introduce \emph{k-hop Explanation with a Convolutional Core} (KEC), a new explanation method that provably maximizes faithfulness to the original GNN by leveraging information about the graph structure in its adjacency matrix and its \emph{k-th} power. Lastly, our empirical results over both synthetic and real-world datasets for classification and anomaly detection tasks with GNNs demonstrate the effectiveness of our approach.

Via

Access Paper or Ask Questions

Consistent Counterfactuals for Deep Models

Oct 06, 2021

Emily Black, Zifan Wang, Matt Fredrikson, Anupam Datta

Figure 1 for Consistent Counterfactuals for Deep Models

Figure 2 for Consistent Counterfactuals for Deep Models

Figure 3 for Consistent Counterfactuals for Deep Models

Figure 4 for Consistent Counterfactuals for Deep Models

Abstract:Counterfactual examples are one of the most commonly-cited methods for explaining the predictions of machine learning models in key areas such as finance and medical diagnosis. Counterfactuals are often discussed under the assumption that the model on which they will be used is static, but in deployment models may be periodically retrained or fine-tuned. This paper studies the consistency of model prediction on counterfactual examples in deep networks under small changes to initial training conditions, such as weight initialization and leave-one-out variations in data, as often occurs during model deployment. We demonstrate experimentally that counterfactual examples for deep models are often inconsistent across such small changes, and that increasing the cost of the counterfactual, a stability-enhancing mitigation suggested by prior work in the context of simpler models, is not a reliable heuristic in deep networks. Rather, our analysis shows that a model's local Lipschitz continuity around the counterfactual is key to its consistency across related models. To this end, we propose Stable Neighbor Search as a way to generate more consistent counterfactual explanations, and illustrate the effectiveness of this approach on several benchmark datasets.

Via

Access Paper or Ask Questions

Boundary Attributions Provide Normal (Vector) Explanations

Mar 23, 2021

Zifan Wang, Matt Fredrikson, Anupam Datta

Figure 1 for Boundary Attributions Provide Normal (Vector) Explanations

Figure 2 for Boundary Attributions Provide Normal (Vector) Explanations

Figure 3 for Boundary Attributions Provide Normal (Vector) Explanations

Figure 4 for Boundary Attributions Provide Normal (Vector) Explanations

Abstract:Recent work on explaining Deep Neural Networks (DNNs) focuses on attributing the model's output scores to input features. However, when it comes to classification problems, a more fundamental question is how much does each feature contributes to the model's decision to classify an input instance into a specific class. Our first contribution is Boundary Attribution, a new explanation method to address this question. BA leverages an understanding of the geometry of activation regions. Specifically, they involve computing (and aggregating) normal vectors of the local decision boundaries for the target input. Our second contribution is a set of analytical results connecting the adversarial robustness of the network and the quality of gradient-based explanations. Specifically, we prove two theorems for ReLU networks: BA of randomized smoothed networks or robustly trained networks is much closer to non-boundary attribution methods than that in standard networks. These analytics encourage users to improve model robustness for high-quality explanations. Finally, we evaluate the proposed methods on ImageNet and show BAs produce more concentrated and sharper visualizations compared with non-boundary ones. We further demonstrate that our method also helps to reduce the sensitivity of attributions to the baseline input if one is required.

* \

Via

Access Paper or Ask Questions

Abstracting Influence Paths for Explaining (Contextualization of) BERT Models

Nov 02, 2020

Kaiji Lu, Zifan Wang, Piotr Mardziel, Anupam Datta

Figure 1 for Abstracting Influence Paths for Explaining (Contextualization of) BERT Models

Figure 2 for Abstracting Influence Paths for Explaining (Contextualization of) BERT Models

Figure 3 for Abstracting Influence Paths for Explaining (Contextualization of) BERT Models

Figure 4 for Abstracting Influence Paths for Explaining (Contextualization of) BERT Models

Abstract:While "attention is all you need" may be proving true, we do not yet know why: attention-based models such as BERT are superior but how they contextualize information even for simple grammatical rules such as subject-verb number agreement (SVA) is uncertain. We introduce multi-partite patterns, abstractions of sets of paths through a neural network model. Patterns quantify and localize the effect of an input concept (e.g., a subject's number) on an output concept (e.g. corresponding verb's number) to paths passing through a sequence of model components, thus surfacing how BERT contextualizes information. We describe guided pattern refinement, an efficient search procedure for finding patterns representative of concept-critical paths. We discover that patterns generate succinct and meaningful explanations for BERT, highlighted by "copy" and "transfer" operations implemented by skip connections and attention heads, respectively. We also show how pattern visualizations help us understand how BERT contextualizes various grammatical concepts, such as SVA across clauses, and why it makes errors in some cases while succeeding in others.

Via

Access Paper or Ask Questions

Towards Behavior-Level Explanation for Deep Reinforcement Learning

Sep 17, 2020

Xuan Chen, Zifan Wang, Yucai Fan, Bonan Jin, Piotr Mardziel, Carlee Joe-Wong, Anupam Datta

Figure 1 for Towards Behavior-Level Explanation for Deep Reinforcement Learning

Figure 2 for Towards Behavior-Level Explanation for Deep Reinforcement Learning

Figure 3 for Towards Behavior-Level Explanation for Deep Reinforcement Learning

Figure 4 for Towards Behavior-Level Explanation for Deep Reinforcement Learning

Abstract:While Deep Neural Networks (DNNs) are becoming the state-of-the-art for many tasks including reinforcement learning (RL), they are especially resistant to human scrutiny and understanding. Input attributions have been a foundational building block for DNN expalainabilty but face new challenges when applied to deep RL. We address the challenges with two novel techniques. We define a class of \emph{behaviour-level attributions} for explaining agent behaviour beyond input importance and interpret existing attribution methods on the behaviour level. We then introduce \emph{$\lambda$-alignment}, a metric for evaluating the performance of behaviour-level attributions methods in terms of whether they are indicative of the agent actions they are meant to explain. Our experiments on Atari games suggest that perturbation-based attribution methods are significantly more suitable to deep RL than alternatives from the perspective of this metric. We argue that our methods demonstrate the minimal set of considerations for adopting general DNN explanation technology to the unique aspects of reinforcement learning and hope the outlined direction can serve as a basis for future research on understanding Deep RL using attribution.

Via

Access Paper or Ask Questions

Fairness Under Feature Exemptions: Counterfactual and Observational Measures

Jun 14, 2020

Sanghamitra Dutta, Praveen Venkatesh, Piotr Mardziel, Anupam Datta, Pulkit Grover

Figure 1 for Fairness Under Feature Exemptions: Counterfactual and Observational Measures

Figure 2 for Fairness Under Feature Exemptions: Counterfactual and Observational Measures

Figure 3 for Fairness Under Feature Exemptions: Counterfactual and Observational Measures

Figure 4 for Fairness Under Feature Exemptions: Counterfactual and Observational Measures

Abstract:With the growing use of AI in highly consequential domains, the quantification and removal of bias with respect to protected attributes, such as gender, race, etc., is becoming increasingly important. While quantifying bias is essential, sometimes the needs of a business (e.g., hiring) may require the use of certain features that are critical in a way that any bias that can be explained by them might need to be exempted. E.g., a standardized test-score may be a critical feature that should be weighed strongly in hiring even if biased, whereas other features, such as zip code may be used only to the extent that they do not discriminate. In this work, we propose a novel information-theoretic decomposition of the total bias (in a counterfactual sense) into a non-exempt component that quantifies the part of the bias that cannot be accounted for by the critical features, and an exempt component which quantifies the remaining bias. This decomposition allows one to check if the bias arose purely due to the critical features (inspired from the business necessity defense of disparate impact law) and also enables selective removal of the non-exempt component if desired. We arrive at this decomposition through examples that lead to a set of desirable properties (axioms) that any measure of non-exempt bias should satisfy. We demonstrate that our proposed counterfactual measure satisfies all of them. Our quantification bridges ideas of causality, Simpson's paradox, and a body of work from information theory called Partial Information Decomposition. We also obtain an impossibility result showing that no observational measure of non-exempt bias can satisfy all of the desirable properties, which leads us to relax our goals and examine observational measures that satisfy only some of these properties. We then perform case studies to show how one can train models while reducing non-exempt bias.

* Journal version (Shorter version was accepted at AAAI 2020 as an oral presentation)

Via

Access Paper or Ask Questions