Global News Digest

arXiv

Hierarchical Online Prompt Mutation with Dual-Loop Feedback for Guardrailed Evidence Document Generation: A Production-Evaluation Case Study

Title: A Production-Evaluation Case Study on Hierarchical Online Prompt Mutation with Dual-Loop Feedback for Guardrailed Evidence Document Generation

Abstract:

In high-stakes environments where production document generation is critical, language models must demonstrate adaptability, adherence to evidence, and full auditability. This study introduces HOPM, a hierarchical online prompt mutation framework, and evaluates its performance within a real-world marketplace dispute-evidence workflow. HOPM operates by treating prompts as dynamic policies. It employs a family and version router to select specific prompts, while deterministic guardrails categorize failures into mutable prompt-token groups. The system refines both routing and mutation priorities through a dual-feedback mechanism that integrates human review with an automated judge.

The core evidence for this framework stems from a matched production-evaluation ablation study. We assessed seven distinct variants across an identical set of 600 cases. This design allowed for a rigorous comparison of HOPM against several baselines, including static prompting, manual iteration, bandit-only routing, mutation-only adaptation, human-only feedback, and automated judge-only feedback. The results demonstrate that the complete HOPM system significantly outperforms the static control. Specifically, it raised the count win rate from 34.7% to 45.7% (an increase of 11.0 percentage points, with a paired McNemar p-value of 1.31e-11). Furthermore, the amount-weighted win rate surged from 22.3% to 41.4% (a gain of 19.1 percentage points, with a 95% paired bootstrap confidence interval of [10.3, 28.9] pp).

In addition to win rates, HOPM improved the mean Likert quality score from 3.18 to 4.40 and lowered the issue-flag rate from 15.3% to 5.2%. To support these findings, the study provides extensive review artifacts, including 770 generated-text reviews, 318 labeled reviewer exports, a calibration slice comprising 10 cases and 61 ratings, and an OCR benchmark of 70 cases and 350 ratings. These resources are intended to calibrate the interpretation of rubrics, guardrails, title risks, and OCR risks, rather than to replace the primary production ablation data. For reproducibility, the paper details the control setup, sample sizes, confidence intervals, paired statistical tests, prompt-token categories, pseudocode, schemas, rubrics, guardrail taxonomies, and constructed examples, ensuring the evaluation structure can be replicated without revealing proprietary evidence.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Schroders Renewable Unit Targets AI Assets as Power Demand Soars
Bloomberg

Schroders Renewable Unit Targets AI Assets as Power Demand Soars

Schroders’ renewable unit targets AI infrastructure, pivoting to meet soaring energy demand from artificial intelligence...

State Street's Paglia on SBI Group Partnership, ETFs
Bloomberg

State Street's Paglia on SBI Group Partnership, ETFs

State Street's Paglia discusses the SBI Group partnership and ETFs, but the source text is missing. Please provide the a...

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’
Bloomberg

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’

Nvidia CEO Jensen Huang advocates for paying workers “as much as possible,” emphasizing maximum compensation. This stanc...

TSE Talking With Regulator For Easing ETF Listing Rules
Bloomberg

TSE Talking With Regulator For Easing ETF Listing Rules

The Tokyo Stock Exchange is discussing with regulators to ease ETF listing rules. This aims to simplify market access an...

S&P DJI CEO on Japan Markets, Mega IPOs
Bloomberg

S&P DJI CEO on Japan Markets, Mega IPOs

S&P DJI CEO discusses Japan's financial markets and major IPOs.