arXiv

$\mathbb{R}^{2k}$ is Theoretically Large Enough for Embedding-based Top-$k$ Retrieval

June 3, 2026 · Zihao Wang, Hang Yin, Lihui Liu, Hanghang Tong, Yangqiu Song, Ginny Wong, Simon See · Original Source

Title: $\mathbb{R}^{2k}$ Offers Sufficient Theoretical Capacity for Embedding-Based Top-$k$ Retrieval

Abstract: This study investigates the Minimal Embeddable Dimension (MED), defined as the minimum vector space dimension required to arrange $m$ object vectors such that score comparison allows for the exact retrieval of any subset containing up to $k$ elements. We demonstrate that the MED is $\Theta(k)$—a value independent of the total number of objects $m$—across inner product, Euclidean distance, and cosine similarity metrics. Furthermore, we examine Robust MED (RMED), a scenario necessitating unit-norm vectors and an $\epsilon$ margin between scores. We establish an $m$-dependent feasibility ceiling for the margin, expressed as $\epsilon_\star(m,k)=m/\sqrt{k(m-1)(m-k)}$, which converges to $1/\sqrt{k}$ in the regime where $m \gg k$. Additionally, a construction utilizing Gaussian centroids provides an upper bound for robust witnesses within the feasible margin range. Our theoretical assertions are validated through numerical simulations on synthetic top-$2$ retrieval tasks, employing cyclic polytopes and centroid query optimization. Moreover, experiments conducted on the LIMIT and LIMIT-small datasets reveal that straightforward embedding-based retrieval baselines can surpass the previously reported single-vector LLM embedding baseline, potentially due to overfitting. Collectively, these theoretical and empirical results eliminate insufficient exact geometric capacity as a limiting factor.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC