RePercENT: Scaling Disentangled Representation Learning Beyond Two Modalities
Title: RePercENT: Expanding Disentangled Representation Learning to More Than Two Modalities
Abstract:
To fully capitalize on the capabilities of multimodal data, it is essential to develop representations that transcend current state-of-the-art alignment and fusion techniques. These new methods must harness all cross-modal interactions while preserving modality-specific details. Disentangled representation learning offers a rigorous approach to uncovering the latent shared and distinct factors concealed within observational data. Although this paradigm is highly promising for multimodal applications, current methodologies are predominantly restricted to scenarios involving only two modalities, a limitation driven by inherent scalability challenges.
In response to this constraint, we introduce RePercENT, a self-supervised framework engineered to overcome these barriers and enable scalable pairwise disentanglement across more than two modalities. Our method utilizes a multimodal "plug-and-play" architecture that functions directly on pre-extracted embeddings. This design removes the necessity for extensive joint pre-training and imposes no specific assumptions about the underlying modalities or the foundation model backbones. Furthermore, we present a joint optimization objective that concurrently extracts shared and unique components, backed by formal theoretical guarantees that define the optimality of our solution. Experimental results across a variety of modalities and tasks demonstrate that RePercENT effectively recovers disentangled components, maintains competitive performance levels, and substantially lowers computational complexity.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC





