Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Qiwei Peng

Concept Space Alignment in Multilingual LLMs

Oct 01, 2024

Qiwei Peng, Anders Søgaard

Figure 1 for Concept Space Alignment in Multilingual LLMs

Figure 2 for Concept Space Alignment in Multilingual LLMs

Figure 3 for Concept Space Alignment in Multilingual LLMs

Figure 4 for Concept Space Alignment in Multilingual LLMs

Abstract:Multilingual large language models (LLMs) seem to generalize somewhat across languages. We hypothesize this is a result of implicit vector space alignment. Evaluating such alignment, we see that larger models exhibit very high-quality linear alignments between corresponding concepts in different languages. Our experiments show that multilingual LLMs suffer from two familiar weaknesses: generalization works best for languages with similar typology, and for abstract concepts. For some models, e.g., the Llama-2 family of models, prompt-based embeddings align better than word embeddings, but the projections are less linear -- an observation that holds across almost all model families, indicating that some of the implicitly learned alignments are broken somewhat by prompt-based methods.

* EMNLP 2024

Via

Access Paper or Ask Questions

Tokenization Falling Short: The Curse of Tokenization

Jun 17, 2024

Yekun Chai, Yewei Fang, Qiwei Peng, Xuhong Li

Abstract:Language models typically tokenize raw text into sequences of subword identifiers from a predefined vocabulary, a process inherently sensitive to typographical errors, length variations, and largely oblivious to the internal structure of tokens-issues we term the curse of tokenization. In this study, we delve into these drawbacks and demonstrate that large language models (LLMs) remain susceptible to these problems. This study systematically investigates these challenges and their impact on LLMs through three critical research questions: (1) complex problem solving, (2) token structure probing, and (3) resilience to typographical variation. Our findings reveal that scaling model parameters can mitigate the issue of tokenization; however, LLMs still suffer from biases induced by typos and other text format variations. Our experiments show that subword regularization such as BPE-dropout can mitigate this issue. We will release our code and data to facilitate further research.

Via

Access Paper or Ask Questions

FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture

Jun 16, 2024

Wenyan Li, Xinyu Zhang, Jiaang Li, Qiwei Peng, Raphael Tang, Li Zhou, Weijia Zhang, Guimin Hu, Yifei Yuan, Anders Søgaard(+2 more)

Figure 1 for FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture

Figure 2 for FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture

Figure 3 for FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture

Figure 4 for FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture

Abstract:Food is a rich and varied dimension of cultural heritage, crucial to both individuals and social groups. To bridge the gap in the literature on the often-overlooked regional diversity in this domain, we introduce FoodieQA, a manually curated, fine-grained image-text dataset capturing the intricate features of food cultures across various regions in China. We evaluate vision-language Models (VLMs) and large language models (LLMs) on newly collected, unseen food images and corresponding questions. FoodieQA comprises three multiple-choice question-answering tasks where models need to answer questions based on multiple images, a single image, and text-only descriptions, respectively. While LLMs excel at text-based question answering, surpassing human accuracy, the open-sourced VLMs still fall short by 41\% on multi-image and 21\% on single-image VQA tasks, although closed-weights models perform closer to human levels (within 10\%). Our findings highlight that understanding food and its cultural implications remains a challenging and under-explored direction.

Via

Access Paper or Ask Questions

HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization

Feb 26, 2024

Qiwei Peng, Yekun Chai, Xuhong Li

Abstract:Large language models (LLMs) have made significant progress in generating codes from textual prompts. However, existing benchmarks have mainly concentrated on translating English prompts to multilingual codes or have been constrained to very limited natural languages (NLs). These benchmarks have overlooked the vast landscape of massively multilingual NL to multilingual code, leaving a critical gap in the evaluation of multilingual LLMs. In response, we introduce HumanEval-XL, a massively multilingual code generation benchmark specifically crafted to address this deficiency. HumanEval-XL establishes connections between 23 NLs and 12 programming languages (PLs), and comprises of a collection of 22,080 prompts with an average of 8.33 test cases. By ensuring parallel data across multiple NLs and PLs, HumanEval-XL offers a comprehensive evaluation platform for multilingual LLMs, allowing the assessment of the understanding of different NLs. Our work serves as a pioneering step towards filling the void in evaluating NL generalization in the area of multilingual code generation. We make our evaluation code and data publicly available at \url{https://github.com/FloatAI/HumanEval-XL}.

* LREC-COLING 2024

Via

Access Paper or Ask Questions

Towards Structure-aware Paraphrase Identification with Phrase Alignment Using Sentence Encoders

Oct 11, 2022

Qiwei Peng, David Weir, Julie Weeds

Figure 1 for Towards Structure-aware Paraphrase Identification with Phrase Alignment Using Sentence Encoders

Figure 2 for Towards Structure-aware Paraphrase Identification with Phrase Alignment Using Sentence Encoders

Figure 3 for Towards Structure-aware Paraphrase Identification with Phrase Alignment Using Sentence Encoders

Figure 4 for Towards Structure-aware Paraphrase Identification with Phrase Alignment Using Sentence Encoders

Abstract:Previous works have demonstrated the effectiveness of utilising pre-trained sentence encoders based on their sentence representations for meaning comparison tasks. Though such representations are shown to capture hidden syntax structures, the direct similarity comparison between them exhibits weak sensitivity to word order and structural differences in given sentences. A single similarity score further makes the comparison process hard to interpret. Therefore, we here propose to combine sentence encoders with an alignment component by representing each sentence as a list of predicate-argument spans (where their span representations are derived from sentence encoders), and decomposing the sentence-level meaning comparison into the alignment between their spans for paraphrase identification tasks. Empirical results show that the alignment component brings in both improved performance and interpretability for various sentence encoders. After closer investigation, the proposed approach indicates increased sensitivity to structural difference and enhanced ability to distinguish non-paraphrases with high lexical overlap.

* COLING 2022 Oral

Via

Access Paper or Ask Questions

Representing Syntax and Composition with Geometric Transformations

Jun 03, 2021

Lorenzo Bertolini, Julie Weeds, David Weir, Qiwei Peng

Figure 1 for Representing Syntax and Composition with Geometric Transformations

Figure 2 for Representing Syntax and Composition with Geometric Transformations

Figure 3 for Representing Syntax and Composition with Geometric Transformations

Figure 4 for Representing Syntax and Composition with Geometric Transformations

Abstract:The exploitation of syntactic graphs (SyGs) as a word's context has been shown to be beneficial for distributional semantic models (DSMs), both at the level of individual word representations and in deriving phrasal representations via composition. However, notwithstanding the potential performance benefit, the syntactically-aware DSMs proposed to date have huge numbers of parameters (compared to conventional DSMs) and suffer from data sparsity. Furthermore, the encoding of the SyG links (i.e., the syntactic relations) has been largely limited to linear maps. The knowledge graphs' literature, on the other hand, has proposed light-weight models employing different geometric transformations (GTs) to encode edges in a knowledge graph (KG). Our work explores the possibility of adopting this family of models to encode SyGs. Furthermore, we investigate which GT better encodes syntactic relations, so that these representations can be used to enhance phrase-level composition via syntactic contextualisation.

* to appear in Findings of ACL 2021

Via

Access Paper or Ask Questions