B-GRTO: Bootstrapped Group Relative Tool Optimization for Referring Segmentation
Title: B-GRTO: Bootstrapped Group Relative Tool Optimization for Referring Segmentation
Abstract: Pixel-level scene understanding, a core component of computer vision, relies heavily on segmentation, which supports critical applications such as medical image analysis and autonomous perception. In the realm of complex referring segmentation, contemporary approaches typically combine large vision-language models with segmentation decoders. In this setup, the language model processes the image and prompt, while the decoder generates the target mask. While reinforcement learning has proven effective for enhancing reasoning-capable vision-language systems, the optimization of trainable components like segmentation decoders usually relies on separate, differentiable objectives. The theoretical integration of these objectives into reinforcement learning frameworks remains largely unexamined. To address this, we propose Group Relative Tool Optimization (GRTO), a rigorous mathematical framework designed to jointly optimize a policy alongside differentiable tool usage. GRTO leverages rollouts from Group Relative Policy Optimization (GRPO) to refine the auxiliary tool objective, allowing gradients from the decoder to enhance policy rewards. Additionally, we introduce Bootstrapped-GRTO (B-GRTO), a cost-effective pre-training strategy that accelerates tool bootstrapping, resulting in quicker convergence and enhanced performance. Evaluations across three demanding referring segmentation benchmarks show that B-GRTO significantly outperforms standard GRPO, achieving results that are comparable to or better than current state-of-the-art methods tailored to specific domains. These findings highlight the benefits of integrating reinforcement learning with differentiable auxiliary objectives for segmentation tasks that require intensive reasoning.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





