arXiv

Feature Alignment Determines Fusion Strategy: A Comparative Study of Cross-Attention and Concatenation in Multimodal Learning

June 2, 2026 · Zhiqiang Zhou, Xuezhen Xie · Original Source

Title: Feature Alignment as the Key Determinant of Fusion Strategy: Comparing Cross-Attention and Concatenation in Multimodal Learning

Original: arXiv:2606.01207v1 Announce Type: cross Abstract: The choice between cross-attention and concatenation for multimodal fusion remains governed by practitioner intuition rather than principled understanding. In this paper, we demonstrate that feature alignment quality, not data scale alone, is the primary determinant of which fusion strategy excels. Through controlled experiments on Flickr8k using two feature extraction backbones (ResNet18 and CLIP ViT-B/32), we show that concatenation outperforms cross-attention by 4.1-5.1 percentage points across all tested scales (2048-16384 samples) when features are pre-aligned by a vision-language pretraining objective. We provide a theoretical explanation grounded in sample complexity analysis: concatenation requires O(d_v + d_t) samples to learn its fusion projection, while cross-attention requires O(d_v * d_t) samples to learn bilinear attention weights, over 256 times as many for 512-dimensional CLIP features. When features are already aligned, the approximation error gap between the two methods vanishes, and concatenation's sample efficiency dominates at all practical dataset sizes. An alignment degradation study confirms a monotonic trend: as feature alignment degrades, concatenation's advantage grows from 1.3% to 2.8%. These findings provide a principled decision framework for fusion method selection in multimodal systems, with direct implications for the design of Multimodal Large Language Models.

Rewrite: Title: Feature Alignment Dictates Fusion Strategy: A Comparative Analysis of Cross-Attention and Concatenation in Multimodal Learning

arXiv:2606.01207v1 Announce Type: cross Abstract: Selecting between cross-attention and concatenation for multimodal fusion has historically relied on practitioner intuition rather than a rigorous theoretical foundation. This study establishes that the quality of feature alignment, rather than dataset size alone, is the critical factor determining the superiority of one fusion strategy over the other. In controlled experiments conducted on the Flickr8k dataset using ResNet18 and CLIP ViT-B/32 as feature extraction backbones, we observed that concatenation surpassed cross-attention by 4.1 to 5.1 percentage points across all evaluated sample sizes (ranging from 2048 to 16384), provided the features were pre-aligned via a vision-language pretraining objective. Our theoretical framework, based on sample complexity analysis, reveals that concatenation needs only O(d_v + d_t) samples to learn its fusion projection, whereas cross-attention demands O(d_v * d_t) samples to acquire bilinear attention weights. For 512-dimensional CLIP features, this disparity results in cross-attention requiring more than 256 times the number of samples. Once features are aligned, the approximation error difference between the two approaches disappears, allowing concatenation’s superior sample efficiency to prevail across all practical dataset scales. Furthermore, an alignment degradation study demonstrates a monotonic relationship: as alignment quality decreases, the performance gap favoring concatenation widens from 1.3% to 2.8%. These insights offer a principled framework for selecting fusion methods in multimodal systems, significantly impacting the architectural design of Multimodal Large Language Models.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC