arXiv

Synthetic Data from Cross-Domain Events for Large-Scale Recommendation Systems

June 2, 2026 · Xiangyu Wang, Yawen He, Shivendra Pratap Singh, Han Huang, Mengtong Hu, Sharath Ciddu, Yi-Hsuan Hsieh, Erik Groving, Yi Ding, Jieming Di, Tony Wang, Min Yun, Xiaoyu Chen, Ling Leng, Rob Malkin · Original Source

Title: Leveraging Synthetic Data Derived from Cross-Domain Events in Large-Scale Recommendation Systems

Abstract:

While large-scale recommendation systems function across a variety of domains, they are frequently hindered by data sparsity and the inherent noise found in implicit feedback signals. Conventional methods typically address these issues by employing model-specific knowledge distillation to transfer information from source domains to a target domain. Drawing inspiration from the breakthrough success of synthetic data generation in the field of large language models (LLMs), we present SCALR (Synthetic Cross-domain Augmentation and Learning for Recommendation). This framework creates synthetic user-item interaction events for a target recommendation domain by utilizing observed events from a source domain.

SCALR breaks down the process of cross-domain learning into two distinct, modular phases. In the first stage, observed user events from source domains are translated by treating event generation as a probability estimation problem: specifically, calculating the likelihood that a user will interact with an item in the target domain, given their historical interactions within the source domain. In the second stage, downstream models utilize these synthetic events as cross-domain learning objectives. This approach augments the training data for the target domain in a manner that is independent of any specific model architecture. Our methodology has demonstrated statistically significant improvements in online A/B testing conducted on an industrial recommendation platform. To the best of our knowledge, this study is among the pioneering efforts to explicitly define cross-domain event transfer as a synthetic data generation task for recommendation systems.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC