Beyond Encoder Accumulation: Measuring Encoder Roles in Multi-Encoder VLMs
Title: Rethinking Encoder Synergy: Quantifying Specific Contributions in Multi-Encoder Vision-Language Models
Abstract:
As foundation models increasingly integrate diverse and heterogeneous visual streams, a fundamental prerequisite for principled architectural design is a deep understanding of how distinct encoders interact during joint training. However, current large vision-language models (LVLMs) lack the necessary analytical tools to assess these dynamics, making it difficult to identify optimal, parameter-efficient encoder configurations prior to training. To address this, we re-examine the roles of encoders within a joint training framework by conducting a comprehensive study on the Cambrian-1 suite of 16 benchmarks. We retrained and evaluated every non-empty subset of five widely used vision encoders using a unified pipeline, consuming approximately 20,000 GPU-hours. This extensive analysis yields three key insights.
First, our results demonstrate that encoder rankings derived from full retraining of each subset diverge significantly from those obtained by masking encoders on a fixed checkpoint, a discrepancy that even extends to identifying the single best-performing encoder overall. Second, we propose decomposing each encoder’s impact into two distinct metrics: Capacity, defined as the performance score an encoder achieves in isolation, and Necessity, measured by the performance degradation when the encoder is removed from the complete ensemble. These two axes are not interchangeable. We find that combining the two encoders with the highest individual Capacity is suboptimal. Instead, pairing a high-Capacity "anchor" with an "adaptive complement" replicates the performance of the full five-encoder model, with additional encoders providing only marginal improvements. Third, at a fixed parameter count, the effective rank of the per-encoder pre-projector explains residual variations in performance. The most effective pairs consist of an anchor whose rank remains stable during joint training and a complement whose rank increases. This suggests that inputs to the encoder-projector interface with higher rank and less collapse facilitate a more favorable optimization regime. By combining the Capacity-Necessity decomposition with pre-projector rank analysis and rigorous retraining evaluations, this work highlights a methodological gap in multi-encoder LVLM design and provides concrete primitives to address it.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



