MusaCoder: Native GPU Kernel Generation with Full-Stack Training on Moore Threads GPU
Title: MusaCoder: Achieving Native GPU Kernel Generation via Full-Stack Training on Moore Threads Architecture
Abstract:
Transforming high-level tensor programs into efficient, executable low-level code is the core challenge of native GPU kernel generation. While existing Large Language Models (LLMs) face difficulties in this domain, execution-based reinforcement learning (RL) approaches are often hindered by issues such as sparse rewards, reward hacking, and training instability. To address these challenges, we introduce MusaCoder, a comprehensive full-stack training framework designed for native GPU kernel generation across both CUDA and MUSA backends.
MusaCoder integrates three key components: progressive kernel-oriented data synthesis, diversity-preserving rejection fine-tuning, and execution-feedback reinforcement learning facilitated by MooreEval—a distributed verifier and reward environment. To ensure RL stability, the framework employs three specialized mechanisms: PrimeEcho, which anchors multi-turn rewards to the first turn; Buffered Dynamic Retry, which recovers signals from hard samples that have completely failed; and MirrorPop, which filters off-policy sequences.
Experimental evaluations on KernelBench and a MUSA-ported variant demonstrate that MusaCoder surpasses both robust open-source and proprietary baselines in terms of empirical speedup and correctness. Specifically, the 9B model performs on par with or better than leading closed-source models, while the 27B model sets a new state of the art. These findings highlight the efficacy of full-stack execution-feedback training for native kernel generation and validate the capability of Moore Threads GPUs to support the entire LLM post-training stack, offering a practical foundation for optimizing large models on emerging accelerators.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC






