arXiv

NILC: Discovering New Intents with LLM-assisted Clustering

June 2, 2026 · Hongtao Wang, Renchi Yang, Wenqing Lin · Original Source

Title: NILC: Discovering New Intents with LLM-assisted Clustering

Abstract:

New Intent Discovery (NID) is a critical capability for practical dialogue systems, enabling the recognition of both established and novel intents from unlabeled user inputs. Current approaches to NID predominantly rely on a two-stage cascaded architecture. In the initial phase, utterances are encoded into informative text embeddings, while the subsequent stage groups these similar embeddings into clusters—representing intents—typically using algorithms like K-Means. However, this sequential pipeline suffers from a lack of feedback loops between stages, preventing mutual refinement. Additionally, clustering based solely on embeddings tends to miss subtle textual nuances, resulting in suboptimal accuracy.

To address these limitations, we introduce NILC, a novel clustering framework designed specifically for efficient NID. NILC employs an iterative workflow that leverages Large Language Models (LLMs) to refine both cluster centroids and the embeddings of uncertain utterances, thereby dynamically updating clustering assignments. First, NILC utilizes LLMs to generate additional semantic centroids, which enriches the contextual semantics alongside the standard Euclidean centroids of the embeddings. Second, the system identifies "hard samples"—such as ambiguous or terse utterances—within clusters and uses LLMs to rewrite them, facilitating subsequent corrections to cluster assignments. Furthermore, to enhance performance in semi-supervised scenarios, we incorporate supervision signals through non-trivial techniques, including seeding and soft must-links. Comprehensive experiments demonstrate that NILC significantly outperforms several recent baselines across six diverse benchmark datasets, maintaining this superiority in both unsupervised and semi-supervised settings.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC