arXiv

LLMs Need Encoders for Semantic IDs Too

June 2, 2026 · Xiangyi Chen, Zelun Wang, Xinyi Li, Yi-Ping Hsu, Jaewon Yang, Jiajing Xu · Original Source

LLMs Require Encoders for Semantic IDs as Well

arXiv:2606.00324v1
Announcement Type: Cross

Abstract:

In multimodal large language models (LLMs), dedicated encoders are essential for connecting non-textual modalities—such as vision encoders for images or depth models for audio codec tokens—since raw token embeddings fail to capture modality-specific structures. This paper posits that Semantic IDs (SIDs), which serve as hierarchical codes in generative recommendation systems, represent another distinct modality. In this context, the meaning of a SID level token is contingent upon its prefix context. However, current approaches typically incorporate SID tokens directly into the vocabulary, relying solely on training to infer these context-dependent meanings from the ground up.

To address this, we introduce PrefixMem, a lightweight SID encoder that utilizes prefix n-gram memory tables. This architecture delivers structured, prefix-conditioned representations to the LLM at SID token positions. Similar to vision encoders in multimodal architectures, PrefixMem can undergo independent pre-training before being integrated into any LLM for joint fine-tuning.

Our evaluation, conducted on large-scale data from Pinterest across various LLM families, demonstrates that PrefixMem enhances deepest-level SID accuracy by as much as 46% (relative) and boosts full-SID retrieval recall by up to 22% (relative) while maintaining matched training compute. The encoder’s advantages are particularly pronounced in challenging cases where greedy decoding falls short, yielding relative accuracy improvements of up to 77%. These findings confirm that, much like other non-language modalities, SID tokens derive significant benefit from the use of a dedicated encoder.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC