arXiv

Cross-Modality Feature Fusion Based on Structured State Space Duality for Multimodal Image Registration Network

Title: Leveraging Structured State Space Duality for Cross-Modality Feature Fusion in Multimodal Image Registration

Abstract

Extracting shared structural information represents the central challenge in multimodal image registration. While Transformers are commonly used, Structured State Space Duality (SSD) provides a more efficient alternative for training and inference, while simultaneously delivering superior global structural feature extraction capabilities. Capitalizing on these benefits, we introduce RegNetMamba-2, a novel algorithm designed for multimodal image registration. This approach integrates SSD into a coarse-to-fine matching framework to effectively capture both local and global structural features.

Our network employs SSD across three distinct scales to extract multimodal features. To enhance local representation, we utilize the SSD’s feature scaling function to prioritize foreground edges and structural details. Furthermore, to address shared feature extraction and multimodal feature fusion across all scales, we developed a cross-modality feature fusion model grounded in SSD. This model comprises two key components: the Cross-Modality feature Interaction (CMI) module and the Multi-Scale feature Fusion (MSF) module. The CMI module facilitates cross-modality feature extraction at each scale through a cross-form SSD mechanism. Meanwhile, the MSF module performs progressive upward fusion at the feature level to refine features, aggregating multimodal data from all scales.

Adhering to the coarse-to-fine strategy, the system gathers features from the 1/8 scale via the CMI module and the 1/2 scale via the MSF module to compute matching probability scores. Subsequently, a pixel-wise correspondence-based matching process is established. Comprehensive experiments indicate that RegNetMamba-2 outperforms state-of-the-art deep learning-based algorithms in both efficiency and performance for multimodal image registration. These results are validated across several datasets, including VIS-SAR (OSDataset), VIS-IR (LGHD/RoadSence), and VIS-NIR (RGB-NIR sense).


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TechCrunch

The world’s largest privately owned laser just turned on

Xcimer Energy activated the Phoenix laser, the world’s largest privately owned laser, aiming to commercialize fusion pow...

Uber Targets Doubling Its Fleet of Electric Motorcycles in Kenya
Bloomberg

Uber Targets Doubling Its Fleet of Electric Motorcycles in Kenya

Uber plans to double its electric motorcycle fleet in Kenya. This expansion aims to enhance sustainable transport option...

AI Saves Time But Most Companies Waste the Gain, Study Shows
Bloomberg

AI Saves Time But Most Companies Waste the Gain, Study Shows

A study reveals that while AI saves employee time, most companies fail to capitalize on these gains, squandering potenti...

JPMorgan Lifts S&P Target on Earnings 'Supercycle'
Bloomberg

JPMorgan Lifts S&P Target on Earnings 'Supercycle'

JPMorgan raised its S&P 500 target, citing an earnings “supercycle” that reflects heightened confidence in corporate pro...

Europe Sleepwalking Into Economic Ruin, Serb Leader Says
Bloomberg

Europe Sleepwalking Into Economic Ruin, Serb Leader Says

Serbian leader warns Europe is sleepwalking into economic ruin.

Delta Electronics Flags Power Crunch
Bloomberg

Delta Electronics Flags Power Crunch

Delta Electronics warns of a looming power deficit due to surging demand and constrained production, predicting serious ...