Taiji: Pareto Optimal Policy Optimization with Semantics-IDs Trade-off for Industrial LLM-Enhanced Recommendation
Title: Taiji: Achieving Pareto Optimality in Policy Optimization via Semantic-ID Trade-offs for Industrial LLM-Enhanced Recommendation
The integration of large language models (LLMs) into recommender systems has emerged as a dominant trend within the industry. Nevertheless, aligning the semantic space of LLMs with the identifier (ID) space of recommenders through post-training techniques, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), continues to present significant hurdles. Current LLM4Rec approaches are primarily constrained by two critical issues: first, the challenge of quantifying and enhancing the quality of chain-of-thought (CoT) reasoning during SFT in open-domain recommendation contexts; and second, the failure to adequately balance the trade-off between LLM semantic rewards and recommendation preference rewards during RL alignment.
Addressing these obstacles, we introduce Taiji, an innovative LLM-as-Enhancer framework tailored for industrial-scale recommender systems. To surmount the SFT bottleneck, our method employs reverse-engineered reasoning alongside open-ended rejection sampling to synthesize high-quality, domain-specific CoT data. To address the complexities of RL alignment, we propose Pareto Optimal Policy Optimization (POPO). This mechanism dynamically calibrates cross-domain reward weights, theoretically securing an optimal equilibrium between the LLM’s semantic world knowledge and the collaborative ID features that reflect real-time user preferences.
The efficacy of Taiji is substantiated by comprehensive offline evaluations and online A/B testing. Since its deployment on Kuaishou’s advertising platform in May 2026, Taiji has been processing requests for more than 400 million users on a daily basis. The system has delivered substantial commercial returns, confirming its robust scalability in web-scale operational environments.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



