Food-R1: A Unified Multi-Task Food Vision-Language Model with Reinforcement Learning
Title: Food-R1: A Unified Multi-Task Food Vision-Language Model with Reinforcement Learning
Abstract:
While recent research has investigated the application of Vision-Language Models (VLMs) to food analysis, the majority of current approaches depend heavily on supervised fine-tuning (SFT). This reliance often constrains the models' ability to reason and generalize effectively. Additionally, the field suffers from a significant shortage of high-quality, large-scale nutritional annotations. To overcome these challenges, we present CalorieBench-80K, a comprehensive benchmark featuring curated calorie labels and dietary advice annotations. Notably, this is believed to be the inaugural food image benchmark to include Chain-of-Thought (CoT) annotations designed for calorie reasoning.
We also introduce Food-R1, a unified food VLM developed within a multi-task learning framework to enhance the model’s versatility. The training process for Food-R1 begins with CoT-based cold-start instruction tuning, followed by reinforcement fine-tuning (RFT) utilizing Group Relative Policy Optimization (GRPO) to boost both reasoning capabilities and overall performance. Our experiments, conducted on CalorieBench-80K as well as several representative benchmarks, demonstrate that Food-R1 consistently surpasses strong baseline methods across various food-related tasks. The code, model weights, and benchmark annotations are publicly accessible via the project repository.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC






