Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text Detection

May 24, 2023

Yuxia Wang, Jonibek Mansurov, Petar Ivanov, Jinyan Su, Artem Shelmanov, Akim Tsvigun, Chenxi Whitehouse, Osama Mohammed Afzal, Tarek Mahmoud, Alham Fikri Aji(+1 more)

Figure 1 for M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text Detection

Figure 2 for M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text Detection

Figure 3 for M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text Detection

Figure 4 for M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text Detection

Share this with someone who'll enjoy it:

Abstract:Large language models (LLMs) have demonstrated remarkable capability to generate fluent responses to a wide variety of user queries, but this has also resulted in concerns regarding the potential misuse of such texts in journalism, educational, and academic context. In this work, we aim to develop automatic systems to identify machine-generated text and to detect potential misuse. We first introduce a large-scale benchmark M4, which is multi-generator, multi-domain, and multi-lingual corpus for machine-generated text detection. Using the dataset, we experiment with a number of methods and we show that it is challenging for detectors to generalize well on unseen examples if they are either from different domains or are generated by different large language models. In such cases, detectors tend to misclassify machine-generated text as human-written. These results show that the problem is far from solved and there is a lot of room for improvement. We believe that our dataset M4, which covers different generators, domains and languages, will enable future research towards more robust approaches for this pressing societal problem. The M4 dataset is available at https://github.com/mbzuai-nlp/M4.

* 11 pages

View paper on

Share this with someone who'll enjoy it:

Title:M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text Detection

Paper and Code