LLMs Need Encoders for Semantic IDs Too
LLMs Require Encoders for Semantic IDs as Well
arXiv:2606.00324v1
Announcement Type: Cross
Abstract:
In multimodal large language models (LLMs), dedicated encoders are essential for connecting non-textual modalitiesāsuch as vision encoders for images or depth models for audio codec tokensāsince raw token embeddings fail to capture modality-specific structures. This paper posits that Semantic IDs (SIDs), which serve as hierarchical codes in generative recommendation systems, represent another distinct modality. In this context, the meaning of a SID level token is contingent upon its prefix context. However, current approaches typically incorporate SID tokens directly into the vocabulary, relying solely on training to infer these context-dependent meanings from the ground up.
To address this, we introduce PrefixMem, a lightweight SID encoder that utilizes prefix n-gram memory tables. This architecture delivers structured, prefix-conditioned representations to the LLM at SID token positions. Similar to vision encoders in multimodal architectures, PrefixMem can undergo independent pre-training before being integrated into any LLM for joint fine-tuning.
Our evaluation, conducted on large-scale data from Pinterest across various LLM families, demonstrates that PrefixMem enhances deepest-level SID accuracy by as much as 46% (relative) and boosts full-SID retrieval recall by up to 22% (relative) while maintaining matched training compute. The encoderās advantages are particularly pronounced in challenging cases where greedy decoding falls short, yielding relative accuracy improvements of up to 77%. These findings confirm that, much like other non-language modalities, SID tokens derive significant benefit from the use of a dedicated encoder.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




