arXiv

Principled Synthetic Data Enables the First Scaling Laws for LLMs in Recommendation

June 2, 2026 · Benyu Zhang, Qiang Zhang, Jianpeng Cheng, Hong-You Chen, Qifei Wang, Wei Sun, Shen Li, Jia Li, Jiahao Wu, Qunshu Zhang, Neeraj Bhatia, Xiangjun Fan, Hong Yan · Original Source

Title: Structured Synthetic Data Facilitates Initial Scaling Laws for LLMs in Recommendation Systems

Abstract:

While Large Language Models (LLMs) offer significant potential for recommender systems, their advancement has been hindered by a lack of predictable scaling laws—essential tools for directing research efforts and optimizing resource distribution. We propose that this stagnation stems from the noise, bias, and incomplete nature of raw user interaction data used in previous continual pre-training (CPT) initiatives. To address this, we present a new, multi-layered framework designed to generate high-quality synthetic data. This approach bypasses existing data issues by constructing a curated, pedagogical curriculum for the LLM.

We provide compelling evidence of our curriculum’s effectiveness, demonstrating that standard sequential models trained on this principled synthetic data vastly outperform those trained on real-world data in downstream ranking tasks. Specifically, the SasRec model achieved a +130% improvement in recall@100, highlighting the data’s superior capacity for teaching generalizable user preference patterns. Leveraging these results, we empirically establish, for the first time, robust power-law scaling for an LLM continually pre-trained on our high-quality, recommendation-specific synthetic data. Our experiments show consistent and predictable reductions in perplexity across various synthetic data modalities. These discoveries lay the groundwork for reliably scaling LLM capabilities within the recommendation sector, marking a pivotal shift in research focus from addressing data deficiencies to utilizing high-quality, structured information.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC