arXiv

POLARIS: Guiding Small Models to Write Long Stories

June 4, 2026 · Rishanth Rajendhran, Jenna Russell, Mohit Iyyer, John Frederick Wieting · Original Source

Title: POLARIS: Steering Compact Models Toward Long-Form Storytelling

Abstract

Small, open-weight language models face significant hurdles in long-form creative writing. Their outputs often fail to meet length requirements, or their narrative quality deteriorates markedly as the desired story length grows, particularly when measured against the performance of state-of-the-art frontier models. To address this, we introduce POLARIS (Policy Optimization with LLM-as-a-judge rewards and Anchored-Reference Injection for Storywriting), an efficient Group Relative Policy Optimization (GRPO) framework that requires lower computational resources. This approach integrates two critical components: an online reward system driven by a structured Story Quality rubric evaluated by a frontier LLM, and Human-Reference Injection (HRI), which utilizes a teacher-forced, human-written story as a high-reward anchor within each GRPO batch.

We applied this training methodology to the Qwen3.5-9B model, utilizing a dataset of roughly 1,400 prompt-story pairs extracted from 100 short-story anthologies. The training process was executed on four A100 GPUs, resulting in the POLARIS-9B model. Evaluated across five distinct benchmarks covering both in-distribution and out-of-distribution prompts and rubrics, POLARIS-9B demonstrates performance competitive with significantly larger open-weight models, while showing superior adherence to length constraints. Blinded human evaluations indicate that POLARIS-9B is favored over the baseline Qwen3.5-9B and performs comparably to the larger Qwen3.5-27B. Notably, despite being trained exclusively on stories of up to 4,000 words, POLARIS-9B maintains high quality even when prompted for narratives three times that length. This capability is significant, as most open-weight models suffer substantial declines in quality and length compliance in such extended regimes. Broadly, our findings suggest that the ability to generalize length serves as a rigorous stress test for creative-writing models and a valuable metric for differentiating between closely matched architectures.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC