arXiv

When Offline Selectors Cannot Beat the Best Single Model: A Diagnostic Study on edX Dropout Prediction

June 4, 2026 · Tyler Crosse, Alan Nadelsticher Ruvalcaba, Dustin Khang LeDuc, Thomas Trask, Nicholas Lytle, David Joyner · Original Source

Title: Why Offline Selectors Struggle to Outperform the Top Single Model: A Diagnostic Analysis of Dropout Prediction on edX

Abstract:

While the theoretical advantage of selecting the optimal predictor for each specific input suggests that ensemble selection methods should surpass any single standalone model, practical applications often reveal a different reality. Models chosen via offline selection mechanisms frequently fail to exceed the performance of the strongest individual predictor. Before investing further in hyperparameter tuning, it is crucial to distinguish between three primary causes of this underperformance: an ill-suited learning algorithm, a state representation that fails to identify the winning model, or a discrepancy between the label distribution in the training buffer and the deployment environment.

This study introduces a three-stage diagnostic framework designed to isolate these issues using a shared data buffer. Stage 1 establishes a local upper bound for oracle recovery by analyzing label consistency via k-NN. Stage 2 evaluates whether specific offline reinforcement learning and behavioral cloning learners—namely BC, DQN, and CQL across various penalty weights—can achieve this established ceiling. Stage 3 employs ablation studies on the selector state to determine if incorporating richer features would improve performance. The synthesis of these stages identifies the most viable path forward: either tuning the learner, redesigning the state representation, or acquiring new data.

We applied this diagnostic approach to a problem of selecting among five dropout-prediction models using edX clickstream data. Our results demonstrate that while the oracle improves average accuracy by 9.7 points over the best single base model across 16 time windows, the offline learners (BC, DQN, and CQL) consistently underperform, landing in a lower accuracy band. This performance gap remains robust even when subject to a tenfold variation in buffer size and testing on 2,000 held-out examples.

The analysis reveals that the primary bottleneck is local representational ambiguity. Specifically, CQL successfully narrows the imitation gap without translating to deployment gains, indicating that model conservatism is not the culprit. Furthermore, regret distributions cluster tightly across all learners, suggesting that tie-breaking mechanisms are not the issue, and the learners converge on similar test accuracies, ruling out buffer-to-deployment label shift as the cause. Consequently, the study concludes that further tuning of the offline learner is unlikely to yield improvements; instead, future efforts should focus on enhancing the state representation or collecting additional data.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC