Fine-Tuning and Serving Gemma 4 31B on Google Cloud TPU: A Technical Comparison with GPU Baselines
Title: Benchmarking TPU vs. GPU for Fine-Tuning and Serving Gemma 4 31B on Google Cloud
Abstract
This study introduces the first comprehensive, end-to-end implementation of fine-tuning and deploying Google’s Gemma 4 31B model on TPU architecture, offering an empirical evaluation of TPU versus GPU platforms for large language model adaptation. We executed training on a Google TPU v5p-8 using LoRA, followed by inference on a TPU v6e-8 (Trillium). Our documentation outlines the complete code-level modifications needed to migrate a GPU-native training workflow—originally based on PyTorch, HuggingFace TRL, and FSDP—to the JAX ecosystem utilizing Tunix and Qwix. These necessary adjustments include configuring mesh settings, updating LoRA module naming standards, correcting sharding annotations, implementing gradient checkpointing, restructuring data pipelines, and developing a bespoke Orbax-to-safetensors checkpoint merging process.
For the inference phase, we describe the Docker configuration for vLLM-TPU required to serve Gemma 4 on the v6e-8 hardware, providing a detailed analysis of its latency and throughput characteristics. When compared against a baseline setup of two H100 GPUs under identical hyperparameters, the TPU training process proved 1.61 times faster while incurring 2.12 times lower costs. In terms of inference throughput, performance remained comparable, with differences within 3%. However, TPUs demonstrated superior time-to-first-token metrics, achieving 235 ms compared to the GPU’s 475 ms. Consequently, the total cost for a representative workload involving both training and service is 1.82 times lower on TPU infrastructure. By addressing a significant gap in the open-source tooling landscape, this work delivers a reproducible, production-grade guide for deploying Gemma 4 on TPUs.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



