Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Guchun Zhang

Text-to-Code Generation with Modality-relative Pre-training

Feb 12, 2024

Fenia Christopoulou, Guchun Zhang, Gerasimos Lampouras

Figure 1 for Text-to-Code Generation with Modality-relative Pre-training

Figure 2 for Text-to-Code Generation with Modality-relative Pre-training

Figure 3 for Text-to-Code Generation with Modality-relative Pre-training

Figure 4 for Text-to-Code Generation with Modality-relative Pre-training

Abstract:Large pre-trained language models have recently been expanded and applied to programming language tasks with great success, often through further pre-training of a strictly-natural language model--where training sequences typically contain both natural and (linearised) programming language. Such approaches effectively map both modalities of the sequence into the same embedding space. However, programming language keywords (e.g. "while") often have very strictly defined semantics. As such, transfer learning from their natural language usage may not necessarily be beneficial to their code application and vise versa. Assuming an already pre-trained language model, in this work we investigate how sequence tokens can be adapted and represented differently, depending on which modality they belong to, and to the ultimate benefit of the downstream task. We experiment with separating embedding spaces between modalities during further model pre-training with modality-relative training objectives. We focus on text-to-code generation and observe consistent improvements across two backbone models and two test sets, measuring pass@$k$ and a novel incremental variation.

* Accepted at EACL 2024. 15 pages, 5 figures, 6 tables

Via

Access Paper or Ask Questions

A Systematic Study of Performance Disparities in Multilingual Task-Oriented Dialogue Systems

Oct 19, 2023

Songbo Hu, Han Zhou, Moy Yuan, Milan Gritta, Guchun Zhang, Ignacio Iacobacci, Anna Korhonen, Ivan Vulić

Figure 1 for A Systematic Study of Performance Disparities in Multilingual Task-Oriented Dialogue Systems

Figure 2 for A Systematic Study of Performance Disparities in Multilingual Task-Oriented Dialogue Systems

Figure 3 for A Systematic Study of Performance Disparities in Multilingual Task-Oriented Dialogue Systems

Figure 4 for A Systematic Study of Performance Disparities in Multilingual Task-Oriented Dialogue Systems

Abstract:Achieving robust language technologies that can perform well across the world's many languages is a central goal of multilingual NLP. In this work, we take stock of and empirically analyse task performance disparities that exist between multilingual task-oriented dialogue (ToD) systems. We first define new quantitative measures of absolute and relative equivalence in system performance, capturing disparities across languages and within individual languages. Through a series of controlled experiments, we demonstrate that performance disparities depend on a number of factors: the nature of the ToD task at hand, the underlying pretrained language model, the target language, and the amount of ToD annotated data. We empirically prove the existence of the adaptation and intrinsic biases in current ToD systems: e.g., ToD systems trained for Arabic or Turkish using annotated ToD data fully parallel to English ToD data still exhibit diminished ToD task performance. Beyond providing a series of insights into the performance disparities of ToD systems in different languages, our analyses offer practical tips on how to approach ToD data collection and system development for new languages.

* Accepted to EMNLP 2023

Via

Access Paper or Ask Questions

Multi3WOZ: A Multilingual, Multi-Domain, Multi-Parallel Dataset for Training and Evaluating Culturally Adapted Task-Oriented Dialog Systems

Jul 26, 2023

Songbo Hu, Han Zhou, Mete Hergul, Milan Gritta, Guchun Zhang, Ignacio Iacobacci, Ivan Vulić, Anna Korhonen

Abstract:Creating high-quality annotated data for task-oriented dialog (ToD) is known to be notoriously difficult, and the challenges are amplified when the goal is to create equitable, culturally adapted, and large-scale ToD datasets for multiple languages. Therefore, the current datasets are still very scarce and suffer from limitations such as translation-based non-native dialogs with translation artefacts, small scale, or lack of cultural adaptation, among others. In this work, we first take stock of the current landscape of multilingual ToD datasets, offering a systematic overview of their properties and limitations. Aiming to reduce all the detected limitations, we then introduce Multi3WOZ, a novel multilingual, multi-domain, multi-parallel ToD dataset. It is large-scale and offers culturally adapted dialogs in 4 languages to enable training and evaluation of multilingual and cross-lingual ToD systems. We describe a complex bottom-up data collection process that yielded the final dataset, and offer the first sets of baseline scores across different ToD-related tasks for future reference, also highlighting its challenging nature.

* A pre-MIT Press publication version for TACL

Via

Access Paper or Ask Questions

PanGu-Coder: Program Synthesis with Function-Level Language Modeling

Jul 22, 2022

Fenia Christopoulou, Gerasimos Lampouras, Milan Gritta, Guchun Zhang, Yinpeng Guo, Zhongqi Li, Qi Zhang, Meng Xiao, Bo Shen, Lin Li(+12 more)

Figure 1 for PanGu-Coder: Program Synthesis with Function-Level Language Modeling

Figure 2 for PanGu-Coder: Program Synthesis with Function-Level Language Modeling

Figure 3 for PanGu-Coder: Program Synthesis with Function-Level Language Modeling

Figure 4 for PanGu-Coder: Program Synthesis with Function-Level Language Modeling

Abstract:We present PanGu-Coder, a pretrained decoder-only language model adopting the PanGu-Alpha architecture for text-to-code generation, i.e. the synthesis of programming language solutions given a natural language problem description. We train PanGu-Coder using a two-stage strategy: the first stage employs Causal Language Modelling (CLM) to pre-train on raw programming language data, while the second stage uses a combination of Causal Language Modelling and Masked Language Modelling (MLM) training objectives that focus on the downstream task of text-to-code generation and train on loosely curated pairs of natural language program definitions and code functions. Finally, we discuss PanGu-Coder-FT, which is fine-tuned on a combination of competitive programming problems and code with continuous integration tests. We evaluate PanGu-Coder with a focus on whether it generates functionally correct programs and demonstrate that it achieves equivalent or better performance than similarly sized models, such as CodeX, while attending a smaller context window and training on less data.

* 27 pages

Via

Access Paper or Ask Questions