Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training

Oct 14, 2021

Patrick Huber, Armen Aghajanyan, Barlas Oğuz, Dmytro Okhonko, Wen-tau Yih, Sonal Gupta, Xilun Chen

Figure 1 for CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training

Figure 2 for CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training

Figure 3 for CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training

Figure 4 for CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training

Share this with someone who'll enjoy it:

Abstract:With the rise of large-scale pre-trained language models, open-domain question-answering (ODQA) has become an important research topic in NLP. Based on the popular pre-training fine-tuning approach, we posit that an additional in-domain pre-training stage using a large-scale, natural, and diverse question-answering (QA) dataset can be beneficial for ODQA. Consequently, we propose a novel QA dataset based on the Common Crawl project in this paper. Using the readily available schema.org annotation, we extract around 130 million multilingual question-answer pairs, including about 60 million English data-points. With this previously unseen number of natural QA pairs, we pre-train popular language models to show the potential of large-scale in-domain pre-training for the task of question-answering. In our experiments, we find that pre-training question-answering models on our Common Crawl Question Answering dataset (CCQA) achieves promising results in zero-shot, low resource and fine-tuned settings across multiple tasks, models and benchmarks.

View paper on

Share this with someone who'll enjoy it:

Title:CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training

Paper and Code