Self-Conditioned Positional HNSW for Overlap-Aware Retrieval in Chunked-Document RAG Systems: Method and Industrial Evidence-Quality Audit
Title: Enhancing Overlap-Aware Retrieval in Chunked-Document RAG Systems via Self-Conditioned Positional HNSW: Methodology and Industrial-Grade Audit
Abstract:
Retrieval-augmented generation (RAG) architectures frequently rely on chunked-document retrieval, where texts are segmented into overlapping units, embedded, and indexed using approximate nearest-neighbor techniques like hierarchical navigable small world (HNSW) graphs. While overlapping chunks ensure comprehensive boundary coverage, they introduce a specific operational flaw: top-k search results often yield near-adjacent segments that duplicate evidence, thereby inefficiently consuming prompt allocation.
To address this, we introduce Self-Conditioned Positional HNSW (SCP-HNSW). This lightweight enhancement incorporates a low-dimensional positional code into chunk embeddings and employs a two-pass query mechanism to calculate and apply a document-position prior specific to the query. Notably, SCP-HNSW preserves the standard HNSW graph construction and traversal processes. Instead, it introduces an auditable selector for the final context assembly that enforces a minimum index gap between retrieved items.
Furthermore, we incorporate industrial review artifacts to evaluate the quality of generated evidence. Our dataset includes a 770-review text-evidence audit, featuring 318 fully labeled reviews, and a 70-case OCR audit comprising 350 ratings. The text audit reveals that 574 of the 770 projected reviews received a rating of 3/5, with only 39 falling into the lower 1-2 range. Additionally, narrative reviewer details were significantly more prevalent than structured issue flags. In the OCR audit, slice-level pass rates varied from 95% for clean chat screenshots down to 45% for handwritten or blurry images, showing moderate to strong agreement. These findings underscore the necessity for overlap-aware, audit-friendly RAG retrieval and highlight the remaining controlled retrieval ablations required to substantiate causal performance claims.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




