From Tokens to Concepts: Leveraging SAE for SPLADE
Title: Advancing from Token-Based to Concept-Based Retrieval via SAE in SPLADE
Abstract:
Learned Sparse Information Retrieval (IR) models, including SPLADE, are renowned for providing a strong balance between efficiency and effectiveness. Nevertheless, their dependence on the underlying backbone vocabulary can restrict performance due to issues like polysemicity and synonymy, while also complicating applications in multi-lingual and multi-modal contexts. To address these constraints, we introduce a method that substitutes the standard backbone vocabulary with a latent space of semantic concepts derived from Sparse Auto-Encoders (SAE). This paper investigates the synergy between these two approaches, details various training strategies, and compares our SAE-SPLADE architecture against conventional SPLADE models. Experimental results indicate that SAE-SPLADE delivers retrieval accuracy on par with SPLADE across both in-domain and out-of-domain scenarios, all while enhancing computational efficiency.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





