arXiv

A Closer Look at In-Distribution vs. Out-of-Distribution Accuracy for Open-Set Test-time Adaptation

June 2, 2026 · Zefeng Li, Evan Shelhamer · Original Source

Title: Examining the Balance Between In-Distribution and Out-of-Distribution Performance in Open-Set Test-Time Adaptation

Abstract:

Open-set test-time adaptation (TTA) involves refining models on incoming data to handle input shifts and the presence of unknown output classes. Although recent advancements have significantly boosted in-distribution (InD) accuracy for known categories, the capacity of these methods to reliably identify out-of-distribution (OOD) unknown classes has not been thoroughly investigated.

To address this gap, we benchmark several robust and open-set TTA approaches—including SAR, OSTTA, UniEnt, and SoTTA—using standard corruption benchmarks. Specifically, we utilize CIFAR-10-C at a smaller scale and ImageNet-C at a larger scale. For the CIFAR-10-C evaluations, OOD data is sourced from SVHN and CIFAR-100, presented in their corrupted versions, SVHN-C and CIFAR-100-C, respectively. In the case of ImageNet-C, we employ OOD datasets from ImageNet-O and Textures, formatted as ImageNet-O-C and Textures-C.

The choice of these datasets reflects varying degrees of semantic distance from the target domain. ImageNet-O is considered closer to ImageNet, containing unknown but semantically related object classes (such as "garlic bread" versus "hot dog" for food items, or "highway" versus "dam" for infrastructure). Conversely, Textures represents a greater distance from ImageNet, consisting of non-object patterns like "cracked" mud, "porous" sponges, and "veined" leaves.

Our study assesses the accuracy and confidence levels of TTA methods when distinguishing between InD and OOD recognition tasks on both CIFAR-10-C and ImageNet-C. We validate the effectiveness of each method’s native OOD detection mechanisms on CIFAR-10-C. Furthermore, we apply these evaluations to ImageNet-C, reporting both accuracy rates and standard OOD detection metrics.

To simulate more realistic scenarios, we also analyze settings where the proportions and rates of OOD data fluctuate. Additionally, to investigate the trade-off between InD recognition and OOD rejection, we introduce a new baseline that substitutes the traditional softmax/multi-class output with a sigmoid/multi-label output.

Our findings reveal, for the first time, that existing open-set TTA methods face significant challenges in balancing InD and OOD accuracy. Moreover, these methods demonstrate only imperfect capabilities in filtering OOD data during their own adaptation updates.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC