Conditional Hypothesis Generation for LLM-Based Text Analysis with Researcher-Specified Covariates
Title: Leveraging Researcher-Defined Covariates for Conditional Hypothesis Generation in LLM-Driven Text Analysis
Abstract
A fundamental objective in computational social science is to identify interpretable variations in language associated with specific outcomes, such as instructional quality or political alignment. While recent approaches utilizing large language models (LLMs) have described these differences in natural language, they typically prioritize globally discriminative patterns. This approach often overlooks covariates that influence the data according to researchers' domain expertise. Consequently, ignoring such covariates can lead to the selection of patterns that reflect confounding variables rather than phenomena of substantive interest.
To address this, we present a framework for conditional hypothesis generation that integrates researcher-specified covariates, thereby directing the discovery process toward distinctions that persist within relevant subgroups. This approach confronts two primary challenges: stratum imbalance, where target subgroups are underrepresented, and sign reversal, where the direction of a difference flips across subgroups. We introduce two methods inspired by econometrics to tackle these issues. The first method employs feature–covariate interactions to identify sign reversals, while the second utilizes within-stratum demeaning and inverse-frequency reweighting to balance underrepresented strata. Our synthetic experiments demonstrate that each method surpasses global baselines in its respective targeted context. Furthermore, expert evaluations conducted on two real-world datasets confirm that generating hypotheses with awareness of covariates yields more valuable insights within pertinent subgroups.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



