Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Abhinav Rao

Jailbreak Paradox: The Achilles' Heel of LLMs

Jun 18, 2024

Abhinav Rao, Monojit Choudhury, Somak Aditya

Abstract:We introduce two paradoxes concerning jailbreak of foundation models: First, it is impossible to construct a perfect jailbreak classifier, and second, a weaker model cannot consistently detect whether a stronger (in a pareto-dominant sense) model is jailbroken or not. We provide formal proofs for these paradoxes and a short case study on Llama and GPT4-o to demonstrate this. We discuss broader theoretical and practical repercussions of these results.

Via

Access Paper or Ask Questions

NORMAD: A Benchmark for Measuring the Cultural Adaptability of Large Language Models

Apr 18, 2024

Abhinav Rao, Akhila Yerukola, Vishwa Shah, Katharina Reinecke, Maarten Sap

Figure 1 for NORMAD: A Benchmark for Measuring the Cultural Adaptability of Large Language Models

Figure 2 for NORMAD: A Benchmark for Measuring the Cultural Adaptability of Large Language Models

Figure 3 for NORMAD: A Benchmark for Measuring the Cultural Adaptability of Large Language Models

Figure 4 for NORMAD: A Benchmark for Measuring the Cultural Adaptability of Large Language Models

Abstract:The integration of Large Language Models (LLMs) into various global cultures fundamentally presents a cultural challenge: LLMs must navigate interactions, respect social norms, and avoid transgressing cultural boundaries. However, it is still unclear if LLMs can adapt their outputs to diverse cultural norms. Our study focuses on this aspect. We introduce NormAd, a novel dataset, which includes 2.6k stories that represent social and cultural norms from 75 countries, to assess the ability of LLMs to adapt to different granular levels of socio-cultural contexts such as the country of origin, its associated cultural values, and prevalent social norms. Our study reveals that LLMs struggle with cultural reasoning across all contextual granularities, showing stronger adaptability to English-centric cultures over those from the Global South. Even with explicit social norms, the top-performing model, Mistral-7b-Instruct, achieves only 81.8\% accuracy, lagging behind the 95.6\% achieved by humans. Evaluation on NormAd further reveals that LLMs struggle to adapt to stories involving gift-giving across cultures. Due to inherent agreement or sycophancy biases, LLMs find it considerably easier to assess the social acceptability of stories that adhere to cultural norms than those that deviate from them. Our benchmark measures the cultural adaptability (or lack thereof) of LLMs, emphasizing the potential to make these technologies more equitable and useful for global audiences.

* Preprint. In Review

Via

Access Paper or Ask Questions

Ethical Reasoning over Moral Alignment: A Case and Framework for In-Context Ethical Policies in LLMs

Oct 11, 2023

Abhinav Rao, Aditi Khandelwal, Kumar Tanmay, Utkarsh Agarwal, Monojit Choudhury

Abstract:In this position paper, we argue that instead of morally aligning LLMs to specific set of ethical principles, we should infuse generic ethical reasoning capabilities into them so that they can handle value pluralism at a global scale. When provided with an ethical policy, an LLM should be capable of making decisions that are ethically consistent to the policy. We develop a framework that integrates moral dilemmas with moral principles pertaining to different foramlisms of normative ethics, and at different levels of abstractions. Initial experiments with GPT-x models shows that while GPT-4 is a nearly perfect ethical reasoner, the models still have bias towards the moral values of Western and English speaking societies.

Via

Access Paper or Ask Questions

Tricking LLMs into Disobedience: Understanding, Analyzing, and Preventing Jailbreaks

May 24, 2023

Abhinav Rao, Sachin Vashistha, Atharva Naik, Somak Aditya, Monojit Choudhury

Figure 1 for Tricking LLMs into Disobedience: Understanding, Analyzing, and Preventing Jailbreaks

Figure 2 for Tricking LLMs into Disobedience: Understanding, Analyzing, and Preventing Jailbreaks

Figure 3 for Tricking LLMs into Disobedience: Understanding, Analyzing, and Preventing Jailbreaks

Figure 4 for Tricking LLMs into Disobedience: Understanding, Analyzing, and Preventing Jailbreaks

Abstract:Recent explorations with commercial Large Language Models (LLMs) have shown that non-expert users can jailbreak LLMs by simply manipulating the prompts; resulting in degenerate output behavior, privacy and security breaches, offensive outputs, and violations of content regulator policies. Limited formal studies have been carried out to formalize and analyze these attacks and their mitigations. We bridge this gap by proposing a formalism and a taxonomy of known (and possible) jailbreaks. We perform a survey of existing jailbreak methods and their effectiveness on open-source and commercial LLMs (such as GPT 3.5, OPT, BLOOM, and FLAN-T5-xxl). We further propose a limited set of prompt guards and discuss their effectiveness against known attack types.

Via

Access Paper or Ask Questions

Punctuation Restoration for Singaporean Spoken Languages: English, Malay, and Mandarin

Dec 10, 2022

Abhinav Rao, Ho Thi-Nga, Chng Eng-Siong

Figure 1 for Punctuation Restoration for Singaporean Spoken Languages: English, Malay, and Mandarin

Figure 2 for Punctuation Restoration for Singaporean Spoken Languages: English, Malay, and Mandarin

Figure 3 for Punctuation Restoration for Singaporean Spoken Languages: English, Malay, and Mandarin

Figure 4 for Punctuation Restoration for Singaporean Spoken Languages: English, Malay, and Mandarin

Abstract:This paper presents the work of restoring punctuation for ASR transcripts generated by multilingual ASR systems. The focus languages are English, Mandarin, and Malay which are three of the most popular languages in Singapore. To the best of our knowledge, this is the first system that can tackle punctuation restoration for these three languages simultaneously. Traditional approaches usually treat the task as a sequential labeling task, however, this work adopts a slot-filling approach that predicts the presence and type of punctuation marks at each word boundary. The approach is similar to the Masked-Language Model approach employed during the pre-training stages of BERT, but instead of predicting the masked word, our model predicts masked punctuation. Additionally, we find that using Jieba1 instead of only using the built-in SentencePiece tokenizer of XLM-R can significantly improve the performance of punctuating Mandarin transcripts. Experimental results on English and Mandarin IWSLT2022 datasets and Malay News show that the proposed approach achieved state-of-the-art results for Mandarin with 73.8% F1-score while maintaining a reasonable F1-score for English and Malay, i.e. 74.7% and 78% respectively. Our source code that allows reproducing the results and building a simple web-based application for demonstration purposes is available on Github.

Via

Access Paper or Ask Questions