Data Enrichment for Symbolic Regression Using Diffusion Models
Title: Enhancing Symbolic Regression Through Physics-Guided Diffusion-Based Data Enrichment
Abstract:
Symbolic regression (SR) serves as a powerful mechanism for scientific discovery, transforming observational data into interpretable governing equations. Nevertheless, the robustness of this approach deteriorates significantly when spatiotemporal measurements are sparse, noisy, or lack physical completeness—conditions frequently encountered in real-world scenarios. While data enrichment (DE) has demonstrated the capacity to address these limitations, the introduction of supplementary samples can distort equation discovery if those samples fail to maintain the physical structure of the target system. Consequently, effective DE demands specialized domain knowledge and technical proficiency, which often restricts its practical application.
To address these challenges, this study presents a physics-guided latent diffusion framework designed to enrich data for downstream SR models. This framework integrates a variational autoencoder, a conditional latent diffusion model, and a physics-informed residual corrector. Together, these components generate synthetic fields that complete sparse observations while adhering to governing physical relations. We assessed the efficacy of this approach using heat conduction, incompressible Navier-Stokes flow, and a moving single-mass Newtonian gravitational potential as test cases, employing GPLearn, DEAP, and PySR as the respective symbolic regression backends.
The findings indicate that enrichment corrected by physical principles consistently enhances recovery performance in sparse data regimes across various physical dynamics and SR algorithms. These outcomes demonstrate that generative enrichment can bolster equation discovery processes without the need for extensive additional domain expertise.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





