arXiv

Hierarchical Online Prompt Mutation with Dual-Loop Feedback for Guardrailed Evidence Document Generation: A Production-Evaluation Case Study

June 2, 2026 · Nataraj Agaram Sundar Tejas Morabia · Original Source

Title: A Production-Evaluation Case Study on Hierarchical Online Prompt Mutation with Dual-Loop Feedback for Guardrailed Evidence Document Generation

Abstract:

In high-stakes environments where production document generation is critical, language models must demonstrate adaptability, adherence to evidence, and full auditability. This study introduces HOPM, a hierarchical online prompt mutation framework, and evaluates its performance within a real-world marketplace dispute-evidence workflow. HOPM operates by treating prompts as dynamic policies. It employs a family and version router to select specific prompts, while deterministic guardrails categorize failures into mutable prompt-token groups. The system refines both routing and mutation priorities through a dual-feedback mechanism that integrates human review with an automated judge.

The core evidence for this framework stems from a matched production-evaluation ablation study. We assessed seven distinct variants across an identical set of 600 cases. This design allowed for a rigorous comparison of HOPM against several baselines, including static prompting, manual iteration, bandit-only routing, mutation-only adaptation, human-only feedback, and automated judge-only feedback. The results demonstrate that the complete HOPM system significantly outperforms the static control. Specifically, it raised the count win rate from 34.7% to 45.7% (an increase of 11.0 percentage points, with a paired McNemar p-value of 1.31e-11). Furthermore, the amount-weighted win rate surged from 22.3% to 41.4% (a gain of 19.1 percentage points, with a 95% paired bootstrap confidence interval of [10.3, 28.9] pp).

In addition to win rates, HOPM improved the mean Likert quality score from 3.18 to 4.40 and lowered the issue-flag rate from 15.3% to 5.2%. To support these findings, the study provides extensive review artifacts, including 770 generated-text reviews, 318 labeled reviewer exports, a calibration slice comprising 10 cases and 61 ratings, and an OCR benchmark of 70 cases and 350 ratings. These resources are intended to calibrate the interpretation of rubrics, guardrails, title risks, and OCR risks, rather than to replace the primary production ablation data. For reproducibility, the paper details the control setup, sample sizes, confidence intervals, paired statistical tests, prompt-token categories, pseudocode, schemas, rubrics, guardrail taxonomies, and constructed examples, ensuring the evaluation structure can be replicated without revealing proprietary evidence.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC