Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Rethinking Data Synthesis: A Teacher Model Training Recipe with Interpretation

Oct 27, 2024

Yifang Chen, David Zhu

Figure 1 for Rethinking Data Synthesis: A Teacher Model Training Recipe with Interpretation

Figure 2 for Rethinking Data Synthesis: A Teacher Model Training Recipe with Interpretation

Figure 3 for Rethinking Data Synthesis: A Teacher Model Training Recipe with Interpretation

Figure 4 for Rethinking Data Synthesis: A Teacher Model Training Recipe with Interpretation

Share this with someone who'll enjoy it:

Abstract:Recent advances in large language model (LLM) training have highlighted the need for diverse, high-quality instruction data. Recently, many works are exploring synthetic data generation using LLMs. However, they primarily focus on prompt engineering with standard supervised instruction-finetuned models, which contains a fundamental limitation: these models are optimized for general question-answering/problem-solving rather than data generation. We propose a paradigm shift named \textbf{NOMAD} by investigating how to specifically train models for data generation, demonstrating that this task differs significantly from training a classical LM. We identify two key factors: no-prompt-masked training and proper training set size selection. Our method, NOMAD, shows substantial improvements over baselines, achieving >4\% gains in TriviaQA and >2\% in GSM8K with limited training data. Finally, we offer new insights by interpreting synthetic data through the lenses of "relevance" and "novelty".

View paper on

Share this with someone who'll enjoy it:

Title:Rethinking Data Synthesis: A Teacher Model Training Recipe with Interpretation

Paper and Code