GRZO: Group-Relative Zeroth-Order Optimization for Large Language Model Fine-Tuning
Title: GRZO: Group-Relative Zeroth-Order Optimization for Large Language Model Fine-Tuning
Abstract: While zeroth-order (ZO) optimization offers a memory-efficient alternative to backpropagation for fine-tuning large language models, its practical application is often hindered by the significant variance inherent in gradient estimation. To address this, we introduce GRZO, a Group-Relative Zeroth-Order optimizer. This approach generates a single pseudo-independent perturbation for each example within a mini-batch and employs group-relative normalization to aggregate per-example losses. This method effectively increases the count of usable gradient directions from one to the size of the batch without incurring extra forward computation costs or increasing memory usage beyond inference levels. Theoretical analysis demonstrates that GRZO is directionally unbiased, with variance decreasing in proportion to the batch size, thereby establishing a tighter nonconvex convergence bound compared to MeZO. Empirical evaluations across RoBERTa-large, Llama3-8B, and OPT-13B on various tasks reveal that GRZO boosts average accuracy on Llama3-8B by $+3.0$ points over MeZO, achieving this with $23\%$ less peak GPU memory. Furthermore, when implemented as a drop-in replacement for the MeZO core, it enhances sparse, low-rank, and quantized ZO variants by an average of $+6.0$.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



