arXiv

Revisiting Model Stitching In the Foundation Model Era

Title: Re-examining Model Stitching in the Era of Foundation Models

Abstract: Model stitching—defined as the connection of a source model’s earlier layers to a target model’s deeper layers through a lightweight intermediate layer—has historically functioned as a diagnostic tool for assessing representational compatibility. Previous research established that models trained on identical datasets can be stitched together with minimal performance degradation, even when they differ in initialization or training objectives. This study re-evaluates the stitchability of Vision Foundation Models (VFMs), which exhibit diverse objectives, datasets, and modalities (including CLIP, DINOv2, and SigLIP 2), posing the central question: Can heterogeneous VFMs be effectively stitched? To address this, we present a comprehensive protocol that systematically varies stitch points, stitch layer architectures, training losses, and downstream applications. Our investigation yields three key insights. First, the methodology used to train the stitch layer is critical; traditional methods, such as matching intermediate features at the connection point or optimizing task loss end-to-end, often fail to preserve accuracy, particularly when stitching at shallow depths. Second, applying a straightforward feature-matching loss at the target model’s penultimate layer enables reliable stitching of heterogeneous VFMs across various vision tasks. Third, when stitching at deeper layers, the resulting model can outperform both individual base models with only a marginal increase in inference cost due to the stitch layer. Leveraging these results, we introduce the VFM Stitch Tree (VST), an architecture that shares early layers among multiple VFMs while preserving their unique later layers. This approach offers a tunable trade-off between accuracy and latency for multimodal large language models that typically rely on several VFMs. Ultimately, this research transforms model stitching from a mere analytical probe into a viable strategy for combining complementary VFM capabilities and identifying the specific points where their representations converge or differ.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

Dimon and SpaceX Executives to Pitch IPO to Clients
Bloomberg

Dimon and SpaceX Executives to Pitch IPO to Clients

JPMorgan Chase CEO Jamie Dimon and SpaceX executives are pitching IPO details to clients.

Financial Times

Europe is finally flexing its innovation muscles

The EU’s new tech sovereignty package signals a positive shift from defensive regulation to proactive innovation, markin...

Apollo’s Zelter Expects High-Grade Debt Sales to Top US Treasuries
Bloomberg

Apollo’s Zelter Expects High-Grade Debt Sales to Top US Treasuries

Apollo’s Zelter expects high-grade debt sales to surpass US Treasuries. He anticipates investment-grade debt outperformi...

EU Insurance Watchdog Warns on Loan Risks
Bloomberg

EU Insurance Watchdog Warns on Loan Risks

EIOPA warns insurers to closely monitor loan risks, though initial reports lack specific details on the nature or scope ...

Glazer Family Members Said to Study Manchester United Stake Sale
Bloomberg

Glazer Family Members Said to Study Manchester United Stake Sale

Reports indicate the Glazer family is evaluating a potential sale of their Manchester United stake, with family members ...

Ares' Blair Jacbobson: Disconnect Over Private Credit Headlines
Bloomberg

Ares' Blair Jacbobson: Disconnect Over Private Credit Headlines

Ares’ Blair Jacobson argues that private credit headlines misrepresent reality, highlighting a disconnect between media ...