arXiv

SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training

June 2, 2026 · Zhongyu He, Yuanfan Li, Fei Huang, Tianyu Chen, Siyuan Chen, Xingyang Li, Meng Hsuan Yu, Xiangrong Liu, Leyi Wei, Lu Pan, Ke Zeng, Xunliang Cai · Original Source

Title: SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training

Abstract: While long-horizon Large Language Model (LLM) agents stand to gain significantly from the utilization of reusable skills, current skill-based approaches often necessitate external skill generators during the training phase or require continuous skill retrieval during inference. These dependencies introduce heightened engineering complexity, expand context windows, and increase deployment latency. To address these challenges, we introduce Self-Internalizing Reinforcement learning with Intrinsic skills (SIRI), a novel three-phase framework designed to enable agents to discover, validate, and internalize skills autonomously, eliminating the need for external skill generators or inference-time skill repositories.

The SIRI methodology begins by warming up the policy via GiGPO to establish fundamental interaction capabilities and gather successful trajectories devoid of explicit skills. Subsequently, the framework engages in self-skill mining: the current policy extracts compact skills from its own successful plain rollouts and validates their efficacy by comparing paired skill-augmented and skill-free trajectories. In the final phase, SIRI distills only those action tokens guided by beneficial skills into the plain policy, leveraging both trajectory-level utility and action-level advantage metrics. Consequently, at inference time, the agent operates using only the original prompt.

Empirical evaluations on the ALFWorld and WebShop benchmarks, utilizing Qwen2.5-7B-Instruct, demonstrate that SIRI enhances GiGPO’s performance, raising scores from 0.908 to 0.930 on ALFWorld and from 0.728 to 0.813 on WebShop. These results surpass various baselines, including prompt-based, RL-based, and memory-augmented methods. Additional analysis indicates that our self-mining strategy delivers performance metrics comparable to distillation techniques employing closed-source large models. The source code for this work is publicly accessible at https://github.com/kirito618/SIRI.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC