arXiv

Human-in-the-Loop Contextual Bandits for Short-Term Rental Dynamic Pricing: Structural Equivalence of Historical Warm-Up and Approval-Gated Live Learning

June 3, 2026 · Oleg Miroshnichenko · Original Source

Title: Structural Parity Between Historical Initialization and Approval-Driven Live Learning in Human-in-the-Loop Contextual Bandits for Short-Term Rental Pricing

Original: arXiv:2606.02595v1 Announce Type: new Abstract: Dynamic pricing in short-term rental (STR) markets presents a distinctive challenge for online learning algorithms: pricing decisions carry significant financial risk, operators require explainability, and market feedback is sparse (one booking outcome per listed night). We introduce the Human-in-the-Loop Gated Bandit (HITL-GB) framework, in which a contextual bandit algorithm generates price recommendations but a human agent retains authority to accept, modify, or reject each recommendation before it is applied. We show that under this approval constraint, historical pricing data -- collected under a prior deterministic policy -- is structurally equivalent to on-policy warm-up data for initialising the bandit's posterior, bypassing the weeks-to-months cold-start period that renders pure online bandit learning impractical in sparse-feedback markets. We formalise the approval-gated reward signal, derive a regularised ridge-regression warm-up procedure from historical episodes, and validate the approach on real STR production data (anonymised urban market, 2 rooms, April 2022 -- April 2026, 1,461 nightly pricing episodes). Our warm-up procedure compresses effective cold-start from ~150 episodes to ~30 episodes when initialising agents from the Hierarchical Factored Thompson Sampling (HF-TS) family. We further argue that the structural equivalence result is domain-agnostic: any high-stakes domain where human approval is legally or operationally required -- including clinical drug dosing, credit origination, content moderation, and radiological diagnosis -- satisfies the same conditions and benefits from the same warm-up strategy. In regulated industries, mandatory human oversight is thus a statistical asset rather than a deployment constraint.

Rewrite: Dynamic pricing within the short-term rental (STR) sector poses unique hurdles for online learning systems, characterized by high financial stakes, a demand for transparent decision-making, and limited feedback loops—specifically, a single booking result for each night a property is listed. To address these issues, we propose the Human-in-the-Loop Gated Bandit (HITL-GB) framework. This model employs a contextual bandit to suggest prices, but mandates that a human operator must explicitly accept, alter, or deny each suggestion before implementation.

Our analysis demonstrates that, given this approval gate, historical pricing records—originally gathered via a fixed deterministic strategy—serve as a structural equivalent to on-policy warm-up data for initializing the bandit’s posterior distribution. This equivalence effectively eliminates the extended cold-start phase, which typically spans weeks to months and makes standalone online bandit learning unviable in environments with sparse feedback. We define the reward signal gated by human approval and develop a warm-up method based on regularized ridge regression applied to historical episodes.

The efficacy of this approach was tested using anonymized production data from an urban STR market, covering two rooms over a four-year period (April 2022 to April 2026) with 1,461 nightly pricing instances. Results indicate that this initialization technique reduces the effective cold-start requirement from approximately 150 episodes to just 30 when deploying agents from the Hierarchical Factored Thompson Sampling (HF-TS) family.

Furthermore, we posit that this structural equivalence is not limited to the hospitality sector. It applies broadly to any high-risk field requiring mandatory human authorization, such as radiological diagnosis, content moderation, credit approval, and clinical drug dosing. Consequently, in heavily regulated industries, the necessity for human oversight should be viewed as a statistical advantage that enables robust model initialization, rather than merely a operational bottleneck.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC