arXiv

Information-Theoretic Lower Bounds for Bit-Constrained Stochastic Optimization via a Reduction to Compressed Gaussian Mean Estimation

June 2, 2026 · Munsik Kim · Original Source

Title: Deriving Information-Theoretic Lower Bounds for Bit-Constrained Stochastic Optimization through Reduction to Compressed Gaussian Mean Estimation

Abstract:

While low-precision pretraining techniques, such as FP8, MXFP4, and NVFP4, have become the standard for state-of-the-art language models, existing literature focuses almost exclusively on achievability—highlighting algorithms and empirical scaling laws—while lacking a corresponding characterization of information-theoretic limits. This study examines a B-bit quantized stochastic first-order oracle, where an optimizer engages in T rounds, receiving in each round a B-bit adaptive public-coin description of its stochastic gradient.

Our primary contribution is an exact reduction that maps the optimization of a strongly convex quadratic family to the problem of interactively compressed Gaussian mean estimation. Under the constraints of the B-bit oracle, the query itself conveys no information, causing the optimization process to collapse precisely into a sequential distributed-estimation problem. This approach yields three unconditional lower bounds: a communication bound of $TB = \Omega(d)$, a statistical bound of $T = \Omega(\sigma^2 d / \epsilon^2)$, and a sharp product-form bound of $T = \Omega((\sigma^2 d / \epsilon^2) \max{1, d/B})$.

The product-form bound is derived unconditionally by noting that a B-bit transcript contains at most $O(TB / \sigma^2)$ Fisher trace information regarding the mean; thus, the number of bits, rather than the dimension, restricts the recoverable information. By combining this insight with the multivariate van Trees inequality, we establish the bound directly, avoiding the need for bounded-likelihood-ratio truncation. We also present a near-matching achievability result, featuring exact per-round bit accounting under a bounded-dynamic-range oracle, which is tight up to a logarithmic factor. It is important to note that the lower bound applies to truly Gaussian (unbounded) gradients, and closing the gap between these oracle types remains an open question.

Furthermore, a sequential rate-distortion perspective extends this reduction to cover correlated and drifting oracles, correcting a previous conjecture: positive noise correlation increases the bound by a factor of $(1+\rho)/(1-\rho)$ rather than reducing it. Ultimately, these bounds establish an information-theoretic baseline for any low-bit gradient pathway, rather than serving as an optimality claim for currently deployed FP4 systems.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC