arXiv

A Closer Look at In-Distribution vs. Out-of-Distribution Accuracy for Open-Set Test-time Adaptation

Title: Examining the Balance Between In-Distribution and Out-of-Distribution Performance in Open-Set Test-Time Adaptation

Abstract:

Open-set test-time adaptation (TTA) involves refining models on incoming data to handle input shifts and the presence of unknown output classes. Although recent advancements have significantly boosted in-distribution (InD) accuracy for known categories, the capacity of these methods to reliably identify out-of-distribution (OOD) unknown classes has not been thoroughly investigated.

To address this gap, we benchmark several robust and open-set TTA approaches—including SAR, OSTTA, UniEnt, and SoTTA—using standard corruption benchmarks. Specifically, we utilize CIFAR-10-C at a smaller scale and ImageNet-C at a larger scale. For the CIFAR-10-C evaluations, OOD data is sourced from SVHN and CIFAR-100, presented in their corrupted versions, SVHN-C and CIFAR-100-C, respectively. In the case of ImageNet-C, we employ OOD datasets from ImageNet-O and Textures, formatted as ImageNet-O-C and Textures-C.

The choice of these datasets reflects varying degrees of semantic distance from the target domain. ImageNet-O is considered closer to ImageNet, containing unknown but semantically related object classes (such as "garlic bread" versus "hot dog" for food items, or "highway" versus "dam" for infrastructure). Conversely, Textures represents a greater distance from ImageNet, consisting of non-object patterns like "cracked" mud, "porous" sponges, and "veined" leaves.

Our study assesses the accuracy and confidence levels of TTA methods when distinguishing between InD and OOD recognition tasks on both CIFAR-10-C and ImageNet-C. We validate the effectiveness of each method’s native OOD detection mechanisms on CIFAR-10-C. Furthermore, we apply these evaluations to ImageNet-C, reporting both accuracy rates and standard OOD detection metrics.

To simulate more realistic scenarios, we also analyze settings where the proportions and rates of OOD data fluctuate. Additionally, to investigate the trade-off between InD recognition and OOD rejection, we introduce a new baseline that substitutes the traditional softmax/multi-class output with a sigmoid/multi-label output.

Our findings reveal, for the first time, that existing open-set TTA methods face significant challenges in balancing InD and OOD accuracy. Moreover, these methods demonstrate only imperfect capabilities in filtering OOD data during their own adaptation updates.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...