arXiv

Revisiting Model Stitching In the Foundation Model Era

June 4, 2026 · Zheda Mai, Ke Zhang, Fu-En Wang, Zixiao Ken Wang, Albert Y. C. Chen, Lu Xia, Min Sun, Wei-Lun Chao, Cheng-Hao Kuo · Original Source

Title: Re-examining Model Stitching in the Era of Foundation Models

Abstract: Model stitching—defined as the connection of a source model’s earlier layers to a target model’s deeper layers through a lightweight intermediate layer—has historically functioned as a diagnostic tool for assessing representational compatibility. Previous research established that models trained on identical datasets can be stitched together with minimal performance degradation, even when they differ in initialization or training objectives. This study re-evaluates the stitchability of Vision Foundation Models (VFMs), which exhibit diverse objectives, datasets, and modalities (including CLIP, DINOv2, and SigLIP 2), posing the central question: Can heterogeneous VFMs be effectively stitched? To address this, we present a comprehensive protocol that systematically varies stitch points, stitch layer architectures, training losses, and downstream applications. Our investigation yields three key insights. First, the methodology used to train the stitch layer is critical; traditional methods, such as matching intermediate features at the connection point or optimizing task loss end-to-end, often fail to preserve accuracy, particularly when stitching at shallow depths. Second, applying a straightforward feature-matching loss at the target model’s penultimate layer enables reliable stitching of heterogeneous VFMs across various vision tasks. Third, when stitching at deeper layers, the resulting model can outperform both individual base models with only a marginal increase in inference cost due to the stitch layer. Leveraging these results, we introduce the VFM Stitch Tree (VST), an architecture that shares early layers among multiple VFMs while preserving their unique later layers. This approach offers a tunable trade-off between accuracy and latency for multimodal large language models that typically rely on several VFMs. Ultimately, this research transforms model stitching from a mere analytical probe into a viable strategy for combining complementary VFM capabilities and identifying the specific points where their representations converge or differ.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC