arXiv

PillarDETR: YOLO-Backbone and RT-DETR Head for Real-Time 3D Object Detection

June 2, 2026 · Smit Kadvani, Shriya Gumber, Kriti Faujdar, Harsh Dave · Original Source

Title: PillarDETR: Leveraging YOLO-Backbone and RT-DETR Head for Real-Time 3D Object Detection

Original: arXiv:2606.01757v1 Announce Type: new

Abstract: Efficient and accurate 3D object detection is essential for ensuring the safety of autonomous vehicles and robotic systems. Although LiDAR point clouds offer precise spatial data, processing them quickly is a major hurdle. Conventional approaches often depend on intricate 3D convolutions or anchor-based frameworks, which find it difficult to strike a balance between high detection accuracy and rapid inference speeds. To address this, we introduce PillarDETR, a new end-to-end architecture for 3D object detection that merges the computational efficiency of pillar-based LiDAR encoding with the strong representational capabilities of contemporary 2D vision models.

In our design, we substitute traditional convolutional backbones with a Cross Stage Partial (CSP) network based on YOLOv8, which facilitates more robust feature extraction from pseudoimages. Additionally, we move away from standard anchor-based or center-based detection heads, opting instead for a Real-Time Detection Transformer (RT-DETR) decoder. This innovative combination enables the model to grasp global context and directly output 3D bounding boxes, eliminating the need for non-maximum suppression (NMS). Comprehensive testing on the KITTI and nuScenes datasets reveals that PillarDETR offers an excellent balance between mean Average Precision (mAP) and inference latency. Ablation studies further validate that the integration of the YOLOv8 backbone and RT-DETR head leads to significant performance gains compared to the PointPillars baseline, positioning PillarDETR as a potent solution for real-time 3D perception tasks.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC