Wavelet as Tokenizer: Preliminary Results on a Shared Wavelet Token Schema for Natural Signals
Title: Wavelet as Tokenizer: Preliminary Results on a Shared Wavelet Token Schema for Natural Signals
Abstract
This study investigates the feasibility of employing a unified wavelet token schema for audio, images, and video, moving away from the traditional approach of using distinct latent grids for each modality. We present an early-stage continuous-token model that utilizes a one-level Haar Discrete Wavelet Transform (DWT) and Inverse DWT (IDWT) as its frontend. The architecture is defined by a shared layout for coefficient tokens, optional structural metadata, lightweight adapters for modality-specific values, and a common token-wise encoder-decoder backbone.
Evaluations on the Speech Commands, EuroSAT RGB, and DAVIS 2017 datasets demonstrate that this dense shared model achieves peak signal-to-noise ratio (PSNR) scores of 39.92 dB for audio, 29.37 dB for images, and 23.93 dB for video. Further analysis through a matched-rate sweep, varying continuous latent scalar budgets, reveals that visual performance improvements cannot be attributed solely to increased latent capacity. Additionally, the experiments indicate that adding metadata embeddings does not consistently yield performance gains across all scenarios.
When comparing fixed-rate energy selection against uniform selection under compressed keep ratios, the former serves as a robust non-parametric baseline, boosting average PSNR by 16.73 dB for audio, 16.90 dB for images, and 15.86 dB for video. Moreover, masked sparse training achieves a video PSNR of 34.45 dB using only 50% of the tokens required by the dense model. These findings advocate for a unified wavelet token schema and a sparse token interface, although they stop short of confirming the viability of a universal discrete vocabulary.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



