arXiv

Cross-Environment Neural Reranking for Sample-Efficient Action Selection in Text-Based Agents

Title: Enhancing Action Selection Efficiency in Text-Based Agents via Cross-Environment Neural Reranking

Abstract:

While large language model agents demonstrate robust capabilities on text-based benchmarks, their high inference costs have spurred interest in compact neural rerankers for action selection. This study explores the feasibility of employing a single, lightweight model to handle action selection across varied environments, a strategy that could remove the need for maintaining separate models for each specific setting. By jointly training DeBERTa-v3 (with parameter counts ranging from 184M to 434M) on the ALFWorld, WebShop, and ScienceWorld datasets, and applying minority-class upsampling, we observe significant improvements. Specifically, rebalanced joint training for two environments yields a net performance gain of +0.412 over single-environment ALFWorld baselines, while maintaining competitive results on WebShop (+0.214 compared to +0.249 for the single-environment approach).

Expanding to three-environment training results in a mean combined net gain of +0.551 ± 0.024 across four random seeds. These outcomes closely match those of specialized single-environment models, demonstrating positive cross-domain transfer capabilities. Our findings indicate that this cross-environment adaptation is highly sample-efficient; fine-tuning on just 9.2% of the target-domain data restores 93% of the performance achieved with the full dataset. Furthermore, increasing model capacity offers limited advantages, suggesting that data diversity is the critical factor driving success.

We also evaluated environment-aware LoRA adapter routing combined with PCGrad, which achieved a top-seed score of +0.611 (seed 42). Other seeds performed at +0.554 (seed 456) and +0.559 (seed 789). However, this approach showed significant variance, with seed 123 dropping to +0.263, resulting in a four-seed mean of +0.497 ± 0.158. This indicates that while the method is promising, it currently lacks stability. Ultimately, joint training utilizing clean data splits and rebalancing proves essential. Upon acceptance, we will publish our three-environment benchmark, comprising 51,580 training instances (derived from 41,740 raw unique states with minority-class upsampling), along with all associated model checkpoints.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...