arXiv

Cross-Modality Feature Fusion Based on Structured State Space Duality for Multimodal Image Registration Network

June 3, 2026 · Zhikang Li, Yan Wu, Xin Hu, Yi Dai, Ming Li · Original Source

Title: Leveraging Structured State Space Duality for Cross-Modality Feature Fusion in Multimodal Image Registration

Abstract

Extracting shared structural information represents the central challenge in multimodal image registration. While Transformers are commonly used, Structured State Space Duality (SSD) provides a more efficient alternative for training and inference, while simultaneously delivering superior global structural feature extraction capabilities. Capitalizing on these benefits, we introduce RegNetMamba-2, a novel algorithm designed for multimodal image registration. This approach integrates SSD into a coarse-to-fine matching framework to effectively capture both local and global structural features.

Our network employs SSD across three distinct scales to extract multimodal features. To enhance local representation, we utilize the SSD’s feature scaling function to prioritize foreground edges and structural details. Furthermore, to address shared feature extraction and multimodal feature fusion across all scales, we developed a cross-modality feature fusion model grounded in SSD. This model comprises two key components: the Cross-Modality feature Interaction (CMI) module and the Multi-Scale feature Fusion (MSF) module. The CMI module facilitates cross-modality feature extraction at each scale through a cross-form SSD mechanism. Meanwhile, the MSF module performs progressive upward fusion at the feature level to refine features, aggregating multimodal data from all scales.

Adhering to the coarse-to-fine strategy, the system gathers features from the 1/8 scale via the CMI module and the 1/2 scale via the MSF module to compute matching probability scores. Subsequently, a pixel-wise correspondence-based matching process is established. Comprehensive experiments indicate that RegNetMamba-2 outperforms state-of-the-art deep learning-based algorithms in both efficiency and performance for multimodal image registration. These results are validated across several datasets, including VIS-SAR (OSDataset), VIS-IR (LGHD/RoadSence), and VIS-NIR (RGB-NIR sense).

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC