[221] Scaling Synthetic Data Creation with 1,000,000,000 Personas

TL;DR

I read this because.. : data synthesis
task : data synthesis, augmentation
problem : more diverse data synthesis
idea : corpus-to-persona, persona-to-instruction data or person-to-text corpus
input/output : corpus -> persona -> instruction data or personalized corpus
architecture : Qwen2-7B
objective : ce loss
baseline : sota LLMs
data : 200K persona hub, 150K problems (proposed)
evaluation : held-out test set, MATH
result : robust on MATH
contribution :
etc. :