On Out-of-sample Embedding in UMAP
Title: Addressing Out-of-Sample Embedding Challenges in UMAP
Abstract: Neighbor embedding techniques elucidate relationships within high-dimensional datasets by generating corresponding graph structures in reduced-dimensional spaces. Among these methods, Uniform Manifold Approximation and Projection (UMAP) has gained significant traction, leveraging algebraic topology to align distance metrics across the original and projected spaces. Despite its effectiveness across various data types, UMAP struggles with integrating out-of-sample points into an already established mapping. Specifically, the algorithm tends to position new data points at the edges of existing clusters rather than embedding them within the cluster interiors alongside their correlated neighbors. This study addresses this "repulsion effect" by optimizing pairwise interactions within the initial k-nearest-neighbor graph. Furthermore, we demonstrate that parameterized UMAP yields superior embeddings compared to non-parametric alternatives, a distinction that becomes more pronounced as data complexity increases, such as in the case of medical imagery. Additionally, our findings indicate that employing a parameterized UMAP naturally alleviates the repulsion issue. We evaluate various UMAP methodologies by assessing trustworthiness, utilizing nearest neighbor classifiers, and examining the attractive and repulsive forces inherent in the resulting embeddings.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC


