Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dung Nguyen Manh

The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

May 09, 2023

Dung Nguyen Manh, Nam Le Hai, Anh T. V. Dau, Anh Minh Nguyen, Khanh Nghiem, Jin Guo, Nghi D. Q. Bui

Abstract:We present The Vault, an open-source, large-scale code-text dataset designed to enhance the training of code-focused large language models (LLMs). Existing open-source datasets for training code-based LLMs often face challenges in terms of size, quality (due to noisy signals), and format (only containing code function and text explanation pairings). The Vault overcomes these limitations by providing 40 million code-text pairs across 10 popular programming languages, thorough cleaning for 10+ prevalent issues, and various levels of code-text pairings, including class, function, and line levels. Researchers and practitioners can utilize The Vault for training diverse code-focused LLMs or incorporate the provided data cleaning methods and scripts to improve their datasets. By employing The Vault as the training dataset for code-centric LLMs, we anticipate significant advancements in code understanding and generation tasks, fostering progress in both artificial intelligence research and software development practices.

Via

Access Paper or Ask Questions

ViWOZ: A Multi-Domain Task-Oriented Dialogue Systems Dataset For Low-resource Language

Mar 15, 2022

Phi Nguyen Van, Tung Cao Hoang, Dung Nguyen Manh, Quan Nguyen Minh, Long Tran Quoc

Figure 1 for ViWOZ: A Multi-Domain Task-Oriented Dialogue Systems Dataset For Low-resource Language

Figure 2 for ViWOZ: A Multi-Domain Task-Oriented Dialogue Systems Dataset For Low-resource Language

Figure 3 for ViWOZ: A Multi-Domain Task-Oriented Dialogue Systems Dataset For Low-resource Language

Figure 4 for ViWOZ: A Multi-Domain Task-Oriented Dialogue Systems Dataset For Low-resource Language

Abstract:Most of the current task-oriented dialogue systems (ToD), despite having interesting results, are designed for a handful of languages like Chinese and English. Therefore, their performance in low-resource languages is still a significant problem due to the absence of a standard dataset and evaluation policy. To address this problem, we proposed ViWOZ, a fully-annotated Vietnamese task-oriented dialogue dataset. ViWOZ is the first multi-turn, multi-domain tasked oriented dataset in Vietnamese, a low-resource language. The dataset consists of a total of 5,000 dialogues, including 60,946 fully annotated utterances. Furthermore, we provide a comprehensive benchmark of both modular and end-to-end models in low-resource language scenarios. With those characteristics, the ViWOZ dataset enables future studies on creating a multilingual task-oriented dialogue system.

Via

Access Paper or Ask Questions