Co-Fusion4D: Spatio-temporal Collaborative Fusion for Robust 3D Object Detection
Title: Co-Fusion4D: Spatio-temporal Collaborative Fusion for Robust 3D Object Detection
Abstract: Accurate perception and dependable decision-making in autonomous driving rely heavily on 3D object detection. Nevertheless, BEV-based detectors frequently suffer from degraded spatiotemporal consistency due to cross-frame inconsistencies caused by both object and ego-motion. These issues result in misaligned BEV features over time. To overcome these hurdles, we introduce Co-Fusion4D, a comprehensive framework designed to maintain cross-frame spatiotemporal consistency while curbing temporal feature drift.
Co-Fusion4D utilizes a current-frame-centric approach, establishing the present frame as the primary information source. It selectively integrates historical data only after it has undergone spatiotemporal filtering and alignment. This dominant-complementary mechanism reduces cumulative alignment errors, blocks the propagation of noisy features, and leverages trustworthy temporal cues to generate a more stable BEV representation. Furthermore, the framework incorporates a Dual Attention Fusion (DAF) module to bolster spatiotemporal feature interaction. DAF combines intra-frame spatial attention with inter-frame temporal attention to dynamically align and merge multi-frame features. This process highlights regions consistent with motion while filtering out spurious correlations.
Moving away from standard uniform fusion methods, this architecture significantly enhances the temporal stability and discriminative power of BEV representations. Comprehensive evaluations on the nuScenes benchmark reveal that Co-Fusion4D attains state-of-the-art results, achieving a mean Average Precision (mAP) of 74.9% and a NuScenes Detection Score (NDS) of 75.6%. These results are accomplished without the need for test-time augmentation or external datasets.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





