arXiv

An Empirical Audit of Input Encoders for Multi-Channel Signal Transformers

June 4, 2026 · Ossi Lehtinen · Original Source

Title: An Empirical Evaluation of Input Encoders for Multi-Channel Signal Transformers

Abstract: When Transformers process multi-channel scalar signals, they are required to embed $C$ concurrent values into a single $d_{\text{model}}$-dimensional vector for each time step. This study conducts an empirical audit of eight distinct input encoding strategies. These methods range from a shared-scalar baseline and per-channel linear projections to techniques involving orthogonality regularization, nonlinear MLP stems, block-partitioned concatenation, channel-independent and channel-as-token architectures, and projected positional encodings. We evaluated these approaches on a synthetic benchmark, specifically engineered to render channel identity significant, as well as on the ETTh1 dataset to validate findings with real-world data. Performance was assessed using next-step negative log-likelihood (NLL).

The primary finding indicates a state of practical near-equivalence among a broad "top tier" of methods. Specifically, the standard per-channel linear projection (nn.Linear(C, $d_{\text{model}}$)) performs on par with every other encoder in this top tier, differing only by small amounts that are statistically significant but practically negligible. Conversely, two encoders demonstrated clear inferiority: the shared-scalar baseline, which suffers from collapse due to explicit information-theoretic constraints, and a channel-independent baseline inspired by PatchTST, which underperformed across both benchmarks and exhibited universal overfitting on the synthetic data.

Further analysis using paired tests clarified two minor performance gaps. First, passing the sinusoidal positional encoding through a learned linear layer provides a slight advantage at small values of $C$. A direct geometric analysis reveals that this improvement stems from positional-channel orthogonalization. Second, a nonlinear MLP stem offers a marginal edge at the largest $C$ values tested, though this advantage diminishes as more training data becomes available. Based on these results, we recommend adopting nn.Linear(C, $d_{\text{model}}$) as the default choice, resorting to more complex architectures only when specific task requirements necessitate it. All code and data required to reproduce the experiments presented in this paper are accessible at https://github.com/OssiLehtinen/channel-encoder-audit.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC