Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Better Synthetic Data by Retrieving and Transforming Existing Datasets

Apr 26, 2024

Saumya Gandhi, Ritu Gala, Vijay Viswanathan, Tongshuang Wu, Graham Neubig

Figure 1 for Better Synthetic Data by Retrieving and Transforming Existing Datasets

Figure 2 for Better Synthetic Data by Retrieving and Transforming Existing Datasets

Figure 3 for Better Synthetic Data by Retrieving and Transforming Existing Datasets

Figure 4 for Better Synthetic Data by Retrieving and Transforming Existing Datasets

Share this with someone who'll enjoy it:

Abstract:Despite recent advances in large language models, building dependable and deployable NLP models typically requires abundant, high-quality training data. However, task-specific data is not available for many use cases, and manually curating task-specific data is labor-intensive. Recent work has studied prompt-driven synthetic data generation using large language models, but these generated datasets tend to lack complexity and diversity. To address these limitations, we introduce a method, DataTune, to make better use of existing, publicly available datasets to improve automatic dataset generation. DataTune performs dataset transformation, enabling the repurposing of publicly available datasets into a format that is directly aligned with the specific requirements of target tasks. On a diverse set of language-based tasks from the BIG-Bench benchmark, we find that finetuning language models via DataTune improves over a few-shot prompting baseline by 49% and improves over existing methods that use synthetic or retrieved training data by 34%. We find that dataset transformation significantly increases the diversity and difficulty of generated data on many tasks. We integrate DataTune into an open-source repository to make this method accessible to the community: https://github.com/neulab/prompt2model.

* PDF fixed in v3

View paper on

Share this with someone who'll enjoy it:

Title:Better Synthetic Data by Retrieving and Transforming Existing Datasets

Paper and Code