arXiv

Locality Does Not Imply Reachability: Boundary Repair in Block-Sparse Causal Attention

Title: Locality Does Not Imply Reachability: Boundary Repair in Block-Sparse Causal Attention

Abstract:

Sparse causal attention mechanisms are typically justified by the principle of sequence locality, which posits that proximal tokens should remain readily accessible while distant ones can be discarded to mitigate computational costs. However, this study highlights a critical discrepancy between sequence locality and the actual reachability of tokens within the attention graph. In architectures employing fixed block causal attention, adjacent tokens may become effectively disconnected in the attention graph at every layer depth.

We characterize this "boundary artifact" using structural dependency sets. Our analysis demonstrates that if every attention layer adheres to an identical fixed block causal mask and all other operations are positionwise, a target representation is restricted to depending solely on tokens within its own block prefix. This limitation results in an architecture-level boundary-copy separation when evaluated against a constructed K-way boundary-copy distribution, leading to a theoretical upper bound on top-1 accuracy of 1/K and a lower bound on expected cross-entropy of log K.

To address this, we derive phase-conditioned coverage functions that reveal how reachability is determined by both the distance between source and target tokens and the target’s specific offset within its block. These coverage laws serve as predictive tools for identifying when sparse patterns are likely to fail, when boundary repairs will be beneficial, and why sliding-window attention cannot simply replace boundary repair techniques.

We introduce Boundary Bridge Attention as a constructive solution. This method maintains the fixed block path while introducing zero-additional-parameter auxiliary causal edges near block boundaries via shared projections. In controlled experiments involving 1024-token sequences, performance improvements were observed primarily in diagnostics aligned with coverage metrics. Furthermore, as secondary evidence of external validity, a probe using a fixed checkpoint on the 8K-token Qwen2.5-7B model exhibited the same pattern of coverage incomparability.

The primary contribution of this work is a theory-driven diagnostic framework designed to address the mismatch between locality and reachability in block-sparse causal attention. This framework is complemented by phase-conditioned coverage analysis and a minimal, effective constructive repair mechanism.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TikTok Billionaire Tops Ambani as Asia’s Second-Richest
Bloomberg

TikTok Billionaire Tops Ambani as Asia’s Second-Richest

TikTok founder surpasses Mukesh Ambani to become Asia’s second-richest person, marking a significant shift in the region...

Publishers in UK can opt out of Google AI search results
BBC News

Publishers in UK can opt out of Google AI search results

UK publishers can now opt out of Google’s AI search summaries, a CMA ruling designed to boost their bargaining power and...

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.
Bloomberg

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.

Kioxia’s market cap nears Toyota’s, signaling a major shift in Japan’s corporate hierarchy. This narrowing gap highlight...

Reuters

Morning Bid: Marvell, a fitting name for the latest AI darling

Reuters highlights Marvell as a top AI stock, noting its name perfectly suits its status as the newest market darling.

Financial Times

Tim Hayward: I built the Jaguar E-Type of computer keyboards

Tim Hayward compares his bespoke keyboard designs to the Jaguar E-Type. He explores high-end customization for personal ...

Financial Times

AI Labs: Zuckerberg’s $100bn gamble

Meta’s $100 billion AI investment aims to secure AI dominance, but questions remain whether sheer spending can outpace c...