DSL-Topic: Improving Topic Modeling by Distilling Soft Labelsfrom Language Models
Title: Enhancing Topic Modeling Through Language Model Distillation of Soft Labels
Abstract: Conventional neural topic models generally rely on optimizing the reconstruction of Bag-of-Words (BoW) representations, a process that frequently neglects contextual nuances and faces challenges related to data sparsity. To address these limitations, this study presents a new training framework for topic models known as Distilling Soft Labels (DSL) from Language Models (LMs). By projecting next-token probabilities—conditioned on a specific prompt—onto a predefined vocabulary, the method generates contextually rich reconstruction signals. The topic models are then trained to reconstruct these soft labels using hidden states derived from the LM. This approach yields superior topics that better reflect the corpus’s thematic architecture. Comprehensive experiments reveal that DSL significantly enhances both topic coherence and assignment accuracy compared to current baseline methods. Furthermore, we propose a retrieval-based evaluation metric, which indicates that our method substantially surpasses existing techniques in locating semantically related documents, thereby underscoring its value for applications focused on retrieval.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC





