arXiv

Family Matters: A Systematic Study of Spatial vs. Frequency Masking for Continual Test-Time Adaptation

June 2, 2026 · Chandler Timm C. Doloriel, Yunbei Zhang, Yeonguk Yu, Taki Hasan Rafi, Muhammad salman siddiqui, Tor Kristian Stevik, Fadi Al Machot, Kristian Hovde Liland, Habib Ullah · Original Source

Title: Family Matters: A Systematic Study of Spatial vs. Frequency Masking for Continual Test-Time Adaptation

Abstract:

While recent approaches to continual test-time adaptation (CTTA) utilize masked image modeling to mitigate learning instability caused by distribution shifts, they typically treat the masking family ($F$) as a static design choice. Consequently, innovation has been concentrated exclusively on the selection strategy ($S$), leaving the family dimension largely underexplored. This paper presents a systematic empirical investigation designed to isolate the impact of this axis. We employ a controlled CTTA instantiation, Mask to Adapt (M2A), which standardizes the selection strategy to random sampling and employs standard loss functions. Within this framework, we vary only the masking family—comparing spatial approaches (pixel, patch) against frequency-based methods (all-band, low-band, high-band)—while holding all other components constant.

Our findings yield specific design guidance for CTTA contexts:

The masking family dictates whether adaptation reinforces useful structure or amplifies errors. On architectures utilizing patch-tokenization, spatial masking facilitates the accumulation of stable representations over extended data streams, whereas frequency masking leads to catastrophic collapse. We attribute this instability to a structural-preservation mechanism: spatial coherence preserves the broad-spectrum redundancy required to prevent terminal overlap with a corruption’s spectral signature.
The optimal family is contingent upon the alignment between architecture and task. The disparity between families disappears in Convolutional Neural Networks (CNNs), where overlapping receptive fields mitigate the effects of patch occlusion. Conversely, in tasks requiring fine-grained global cues and large-capacity Vision Transformers (ViTs), frequency masking proves to be a competitive alternative.

In system-level comparisons that are confounded by differences in losses and auxiliary components, M2A’s random selection strategy performs on par with heuristic methods. However, we interpret this finding as suggestive context rather than a controlled quantification of the relative importance of the selection strategy $S$.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC