arXiv

When Does Complexity Conditioning Help a Frozen Sentence Embedding? A Controlled Study of Per-Sentence and Pair-Level Difficulty Adaptation

June 3, 2026 · Suhwan Hwang · Original Source

Title: Re-evaluating the Utility of Complexity Conditioning for Frozen Sentence Embeddings: A Rigorous Analysis of Per-Sentence versus Pair-Level Difficulty Adaptation

Abstract

It is widely assumed that sentence embedding models should dynamically adjust their representations based on the complexity of the input data. To rigorously examine this hypothesis, we conducted a controlled study utilizing multiple random seeds. Our experimental setup involved attaching a lightweight post-encoder adapter to a frozen Qwen3-Embedding-0.6B encoder, which interacted exclusively with the model’s final pooled embedding. This architecture was tested across four benchmarks focused on paraphrase detection and semantic similarity: PAWS, MRPC, QQP, and STS-B.

Our findings indicate that the straightforward application of this concept is ineffective. Specifically, surface-level complexity measures for individual sentences show almost no correlation with errors in the frozen baseline (Pearson coefficient ≈ 0.05). Consequently, this approach offers no performance benefit compared to constant or shuffled control groups and actually deteriorates the performance of a saturated baseline. Furthermore, even when the target variable is aligned with a non-circular, pair-specific difficulty metric, the per-sentence gating mechanism fails to accurately capture difficulty. This failure occurs because difficulty is fundamentally a characteristic of the sentence pair as a whole, rather than an attribute of a single sentence in isolation.

In contrast, we demonstrate that a small residual module, gated by a difficulty signal derived from a held-out cross-encoder, delivers consistent improvements on larger and more nuanced tasks. This approach resulted in a Spearman correlation increase of +0.022 on STS-B and +0.037 on QQP, while maintaining stability relative to the frozen baseline across all experimental seeds. Given that this effective method operates on sentence pairs rather than isolated inputs, the resulting system is more accurately described as a lightweight re-ranking mechanism applied to pre-cached frozen embeddings, rather than a substitute for generating single-vector embeddings. We do not claim state-of-the-art status for this method. Instead, our primary contribution is a detailed, controlled analysis delineating the specific conditions under which difficulty-aware adaptation yields benefits and when it proves ineffective, alongside a pre-training diagnostic tool designed to predict the potential for improvement.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC