Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Karen Hambardzumyan

With Great Backbones Comes Great Adversarial Transferability

Jan 21, 2025

Erik Arakelyan, Karen Hambardzumyan, Davit Papikyan, Pasquale Minervini, Albert Gordo, Isabelle Augenstein, Aram H. Markosyan

Figure 1 for With Great Backbones Comes Great Adversarial Transferability

Figure 2 for With Great Backbones Comes Great Adversarial Transferability

Figure 3 for With Great Backbones Comes Great Adversarial Transferability

Figure 4 for With Great Backbones Comes Great Adversarial Transferability

Abstract:Advances in self-supervised learning (SSL) for machine vision have improved representation robustness and model performance, giving rise to pre-trained backbones like \emph{ResNet} and \emph{ViT} models tuned with SSL methods such as \emph{SimCLR}. Due to the computational and data demands of pre-training, the utilization of such backbones becomes a strenuous necessity. However, employing these backbones may inherit vulnerabilities to adversarial attacks. While adversarial robustness has been studied under \emph{white-box} and \emph{black-box} settings, the robustness of models tuned on pre-trained backbones remains largely unexplored. Additionally, the role of tuning meta-information in mitigating exploitation risks is unclear. This work systematically evaluates the adversarial robustness of such models across $20,000$ combinations of tuning meta-information, including fine-tuning techniques, backbone families, datasets, and attack types. We propose using proxy models to transfer attacks, simulating varying levels of target knowledge by fine-tuning these proxies with diverse configurations. Our findings reveal that proxy-based attacks approach the effectiveness of \emph{white-box} methods, even with minimal tuning knowledge. We also introduce a naive "backbone attack," leveraging only the backbone to generate adversarial samples, which outperforms \emph{black-box} attacks and rivals \emph{white-box} methods, highlighting critical risks in model-sharing practices. Finally, our ablations reveal how increasing tuning meta-information impacts attack transferability, measuring each meta-information combination.

Via

Access Paper or Ask Questions

Robust LLM safeguarding via refusal feature adversarial training

Sep 30, 2024

Lei Yu, Virginie Do, Karen Hambardzumyan, Nicola Cancedda

Figure 1 for Robust LLM safeguarding via refusal feature adversarial training

Figure 2 for Robust LLM safeguarding via refusal feature adversarial training

Figure 3 for Robust LLM safeguarding via refusal feature adversarial training

Figure 4 for Robust LLM safeguarding via refusal feature adversarial training

Abstract:Large language models (LLMs) are vulnerable to adversarial attacks that can elicit harmful responses. Defending against such attacks remains challenging due to the opacity of jailbreaking mechanisms and the high computational cost of training LLMs robustly. We demonstrate that adversarial attacks share a universal mechanism for circumventing LLM safeguards that works by ablating a dimension in the residual stream embedding space called the refusal feature. We further show that the operation of refusal feature ablation (RFA) approximates the worst-case perturbation of offsetting model safety. Based on these findings, we propose Refusal Feature Adversarial Training (ReFAT), a novel algorithm that efficiently performs LLM adversarial training by simulating the effect of input-level attacks via RFA. Experiment results show that ReFAT significantly improves the robustness of three popular LLMs against a wide range of adversarial attacks, with considerably less computational overhead compared to existing adversarial training methods.

Via

Access Paper or Ask Questions

LM Transparency Tool: Interactive Tool for Analyzing Transformer Language Models

Apr 10, 2024

Igor Tufanov, Karen Hambardzumyan, Javier Ferrando, Elena Voita

Figure 1 for LM Transparency Tool: Interactive Tool for Analyzing Transformer Language Models

Abstract:We present the LM Transparency Tool (LM-TT), an open-source interactive toolkit for analyzing the internal workings of Transformer-based language models. Differently from previously existing tools that focus on isolated parts of the decision-making process, our framework is designed to make the entire prediction process transparent, and allows tracing back model behavior from the top-layer representation to very fine-grained parts of the model. Specifically, it (1) shows the important part of the whole input-to-output information flow, (2) allows attributing any changes done by a model block to individual attention heads and feed-forward neurons, (3) allows interpreting the functions of those heads or neurons. A crucial part of this pipeline is showing the importance of specific model components at each step. As a result, we are able to look at the roles of model components only in cases where they are important for a prediction. Since knowing which components should be inspected is key for analyzing large models where the number of these components is extremely high, we believe our tool will greatly support the interpretability community both in research settings and in practical applications.

Via

Access Paper or Ask Questions

Scaling Laws for Generative Mixed-Modal Language Models

Jan 10, 2023

Armen Aghajanyan, Lili Yu, Alexis Conneau, Wei-Ning Hsu, Karen Hambardzumyan, Susan Zhang, Stephen Roller, Naman Goyal, Omer Levy, Luke Zettlemoyer

Abstract:Generative language models define distributions over sequences of tokens that can represent essentially any combination of data modalities (e.g., any permutation of image tokens from VQ-VAEs, speech tokens from HuBERT, BPE tokens for language or code, and so on). To better understand the scaling properties of such mixed-modal models, we conducted over 250 experiments using seven different modalities and model sizes ranging from 8 million to 30 billion, trained on 5-100 billion tokens. We report new mixed-modal scaling laws that unify the contributions of individual modalities and the interactions between them. Specifically, we explicitly model the optimal synergy and competition due to data and model size as an additive term to previous uni-modal scaling laws. We also find four empirical phenomena observed during the training, such as emergent coordinate-ascent style training that naturally alternates between modalities, guidelines for selecting critical hyper-parameters, and connections between mixed-modal competition and training stability. Finally, we test our scaling law by training a 30B speech-text model, which significantly outperforms the corresponding unimodal models. Overall, our research provides valuable insights into the design and training of mixed-modal generative models, an important new class of unified models that have unique distributional properties.

Via

Access Paper or Ask Questions

BARTSmiles: Generative Masked Language Models for Molecular Representations

Nov 29, 2022

Gayane Chilingaryan, Hovhannes Tamoyan, Ani Tevosyan, Nelly Babayan, Lusine Khondkaryan, Karen Hambardzumyan, Zaven Navoyan, Hrant Khachatrian, Armen Aghajanyan

Figure 1 for BARTSmiles: Generative Masked Language Models for Molecular Representations

Figure 2 for BARTSmiles: Generative Masked Language Models for Molecular Representations

Figure 3 for BARTSmiles: Generative Masked Language Models for Molecular Representations

Figure 4 for BARTSmiles: Generative Masked Language Models for Molecular Representations

Abstract:We discover a robust self-supervised strategy tailored towards molecular representations for generative masked language models through a series of tailored, in-depth ablations. Using this pre-training strategy, we train BARTSmiles, a BART-like model with an order of magnitude more compute than previous self-supervised molecular representations. In-depth evaluations show that BARTSmiles consistently outperforms other self-supervised representations across classification, regression, and generation tasks setting a new state-of-the-art on 11 tasks. We then quantitatively show that when applied to the molecular domain, the BART objective learns representations that implicitly encode our downstream tasks of interest. For example, by selecting seven neurons from a frozen BARTSmiles, we can obtain a model having performance within two percentage points of the full fine-tuned model on task Clintox. Lastly, we show that standard attribution interpretability methods, when applied to BARTSmiles, highlight certain substructures that chemists use to explain specific properties of molecules. The code and the pretrained model are publicly available.

* 27 pages (including appendix)

Via

Access Paper or Ask Questions

WARP: Word-level Adversarial ReProgramming

Jan 01, 2021

Karen Hambardzumyan, Hrant Khachatrian, Jonathan May

Figure 1 for WARP: Word-level Adversarial ReProgramming

Figure 2 for WARP: Word-level Adversarial ReProgramming

Figure 3 for WARP: Word-level Adversarial ReProgramming

Abstract:Transfer learning from pretrained language models recently became the dominant approach for solving many NLP tasks. While fine-tuning large language models usually gives the best performance, in many applications it is preferable to tune much smaller sets of parameters, so that the majority of parameters can be shared across multiple tasks. The main approach is to train one or more task-specific layers on top of the language model. In this paper we present an alternative approach based on adversarial reprogramming, which extends earlier work on automatic prompt generation. It attempts to learn task-specific word embeddings that, when concatenated to the input text, instruct the language model to solve the specified task. We show that this approach outperforms other methods with a similar number of trainable parameters on SST-2 and MNLI datasets. On SST-2, the performance of our model is comparable to the fully fine-tuned baseline, while on MNLI it is the best among the methods that do not modify the parameters of the body of the language model.

Via

Access Paper or Ask Questions

Towards JointUD: Part-of-speech Tagging and Lemmatization using Recurrent Neural Networks

Sep 10, 2018

Gor Arakelyan, Karen Hambardzumyan, Hrant Khachatrian

Figure 1 for Towards JointUD: Part-of-speech Tagging and Lemmatization using Recurrent Neural Networks

Figure 2 for Towards JointUD: Part-of-speech Tagging and Lemmatization using Recurrent Neural Networks

Figure 3 for Towards JointUD: Part-of-speech Tagging and Lemmatization using Recurrent Neural Networks

Abstract:This paper describes our submission to CoNLL 2018 UD Shared Task. We have extended an LSTM-based neural network designed for sequence tagging to additionally generate character-level sequences. The network was jointly trained to produce lemmas, part-of-speech tags and morphological features. Sentence segmentation, tokenization and dependency parsing were handled by UDPipe 1.2 baseline. The results demonstrate the viability of the proposed multitask architecture, although its performance still remains far from state-of-the-art.

* System description paper of our system for the CoNLL 2018 shared task on Universal Dependency parsing

Via

Access Paper or Ask Questions

Technical Report on the CleverHans v2.1.0 Adversarial Examples Library

Jun 27, 2018

Nicolas Papernot, Fartash Faghri, Nicholas Carlini, Ian Goodfellow, Reuben Feinman, Alexey Kurakin, Cihang Xie, Yash Sharma, Tom Brown, Aurko Roy(+16 more)

Abstract:CleverHans is a software library that provides standardized reference implementations of adversarial example construction techniques and adversarial training. The library may be used to develop more robust machine learning models and to provide standardized benchmarks of models' performance in the adversarial setting. Benchmarks constructed without a standardized implementation of adversarial example construction are not comparable to each other, because a good result may indicate a robust model or it may merely indicate a weak implementation of the adversarial example construction procedure. This technical report is structured as follows. Section 1 provides an overview of adversarial examples in machine learning and of the CleverHans software. Section 2 presents the core functionalities of the library: namely the attacks based on adversarial examples and defenses to improve the robustness of machine learning models to these attacks. Section 3 describes how to report benchmark results using the library. Section 4 describes the versioning system.

* Technical report for https://github.com/tensorflow/cleverhans

Via

Access Paper or Ask Questions

Natural Language Inference over Interaction Space: ICLR 2018 Reproducibility Report

Feb 09, 2018

Martin Mirakyan, Karen Hambardzumyan, Hrant Khachatrian

Figure 1 for Natural Language Inference over Interaction Space: ICLR 2018 Reproducibility Report

Figure 2 for Natural Language Inference over Interaction Space: ICLR 2018 Reproducibility Report

Abstract:We have tried to reproduce the results of the paper "Natural Language Inference over Interaction Space" submitted to ICLR 2018 conference as part of the ICLR 2018 Reproducibility Challenge. Initially, we were not aware that the code was available, so we started to implement the network from scratch. We have evaluated our version of the model on Stanford NLI dataset and reached 86.38% accuracy on the test set, while the paper claims 88.0% accuracy. The main difference, as we understand it, comes from the optimizers and the way model selection is performed.

* as part of ICLR 2018 Reproducibility Challenge

Via

Access Paper or Ask Questions