arXiv

MusaCoder: Native GPU Kernel Generation with Full-Stack Training on Moore Threads GPU

June 4, 2026 · Kun Cheng, Songshuo Lu, Sicong Liao, Tankun Li, Yafei Zhang, Dong Yang, Qiheng Lv, Hua Wang, Zhi Chen, Yaohua Tang · Original Source

Title: MusaCoder: Achieving Native GPU Kernel Generation via Full-Stack Training on Moore Threads Architecture

Abstract:

Transforming high-level tensor programs into efficient, executable low-level code is the core challenge of native GPU kernel generation. While existing Large Language Models (LLMs) face difficulties in this domain, execution-based reinforcement learning (RL) approaches are often hindered by issues such as sparse rewards, reward hacking, and training instability. To address these challenges, we introduce MusaCoder, a comprehensive full-stack training framework designed for native GPU kernel generation across both CUDA and MUSA backends.

MusaCoder integrates three key components: progressive kernel-oriented data synthesis, diversity-preserving rejection fine-tuning, and execution-feedback reinforcement learning facilitated by MooreEval—a distributed verifier and reward environment. To ensure RL stability, the framework employs three specialized mechanisms: PrimeEcho, which anchors multi-turn rewards to the first turn; Buffered Dynamic Retry, which recovers signals from hard samples that have completely failed; and MirrorPop, which filters off-policy sequences.

Experimental evaluations on KernelBench and a MUSA-ported variant demonstrate that MusaCoder surpasses both robust open-source and proprietary baselines in terms of empirical speedup and correctness. Specifically, the 9B model performs on par with or better than leading closed-source models, while the 27B model sets a new state of the art. These findings highlight the efficacy of full-stack execution-feedback training for native kernel generation and validate the capability of Moore Threads GPUs to support the entire LLM post-training stack, offering a practical foundation for optimizing large models on emerging accelerators.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC