arXiv

Cross-Environment Neural Reranking for Sample-Efficient Action Selection in Text-Based Agents

June 2, 2026 · Kan Shao · Original Source

Title: Enhancing Action Selection Efficiency in Text-Based Agents via Cross-Environment Neural Reranking

Abstract:

While large language model agents demonstrate robust capabilities on text-based benchmarks, their high inference costs have spurred interest in compact neural rerankers for action selection. This study explores the feasibility of employing a single, lightweight model to handle action selection across varied environments, a strategy that could remove the need for maintaining separate models for each specific setting. By jointly training DeBERTa-v3 (with parameter counts ranging from 184M to 434M) on the ALFWorld, WebShop, and ScienceWorld datasets, and applying minority-class upsampling, we observe significant improvements. Specifically, rebalanced joint training for two environments yields a net performance gain of +0.412 over single-environment ALFWorld baselines, while maintaining competitive results on WebShop (+0.214 compared to +0.249 for the single-environment approach).

Expanding to three-environment training results in a mean combined net gain of +0.551 ± 0.024 across four random seeds. These outcomes closely match those of specialized single-environment models, demonstrating positive cross-domain transfer capabilities. Our findings indicate that this cross-environment adaptation is highly sample-efficient; fine-tuning on just 9.2% of the target-domain data restores 93% of the performance achieved with the full dataset. Furthermore, increasing model capacity offers limited advantages, suggesting that data diversity is the critical factor driving success.

We also evaluated environment-aware LoRA adapter routing combined with PCGrad, which achieved a top-seed score of +0.611 (seed 42). Other seeds performed at +0.554 (seed 456) and +0.559 (seed 789). However, this approach showed significant variance, with seed 123 dropping to +0.263, resulting in a four-seed mean of +0.497 ± 0.158. This indicates that while the method is promising, it currently lacks stability. Ultimately, joint training utilizing clean data splits and rebalancing proves essential. Upon acceptance, we will publish our three-environment benchmark, comprising 51,580 training instances (derived from 41,740 raw unique states with minority-class upsampling), along with all associated model checkpoints.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC