arXiv

Ultralytics YOLO26: Unified Real-Time End-to-End Vision Models

June 3, 2026 · Glenn Jocher, Jing Qiu, Mengyu Liu, Shuai Lyu, Fatih Cagatay Akyon, Muhammet Esat Kalfaoglu · Original Source

Title: Ultralytics YOLO26: A Unified Framework for Real-Time End-to-End Vision Tasks

Abstract:

To meet the growing need for real-time vision solutions that are accurate, efficient, and easy to deploy across various hardware platforms, the YOLO family has achieved widespread adoption. However, traditional YOLO detectors often face significant bottlenecks: they depend on non-maximum suppression (NMS) during inference, utilize bulky detection heads driven by Distribution Focal Loss, necessitate lengthy training periods, and frequently fail to assign positive labels to the smallest objects. In this work, we introduce Ultralytics YOLO26, a comprehensive family of real-time vision models designed to overcome these constraints through synchronized architectural and training innovations.

YOLO26 employs a dual-head architecture to enable native, end-to-end inference without the need for NMS. By completely eliminating Distribution Focal Loss (DFL), the model features a lighter detection head with an unconstrained regression range. The training methodology is enhanced by three key components: MuSGD, a hybrid optimizer combining Muon and SGD techniques borrowed from large language model training; Progressive Loss, which progressively directs supervision toward the head used during inference; and STAL, a label assignment mechanism that ensures small objects always receive positive label coverage.

Beyond standard detection, YOLO26 offers specialized head and loss configurations for instance segmentation, pose estimation, and oriented detection, delivering consistent performance improvements across different tasks and scales. The family comprises five model sizes (n, s, m, l, x) and supports a unified pipeline for detection, instance segmentation, pose estimation, classification, and oriented detection. Additionally, it includes YOLOE-26, an open-variant extension that enables inference without the need for text, visual, or prompt inputs.

Benchmark results demonstrate that YOLO26 significantly advances the accuracy-latency Pareto frontier compared to previous real-time detectors. Across all scales, the models achieve between 40.9 and 57.5 mAP on the COCO dataset, with T4 TensorRT latencies ranging from 1.7 to 11.8 ms. Furthermore, the largest variant, YOLOE-26x, achieves a 40.6 AP score on the LVIS minival dataset when using text prompting. The source code and pre-trained models are publicly available at https://github.com/ultralytics/ultralytics.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC