Annotation-Informed Block-Sparse Bayesian Modeling for cis-Expression Prediction
Title: Leveraging Annotation Data for Block-Sparse Bayesian Modeling to Predict cis-Expression
Accurate modeling of local regulatory architecture is the cornerstone of genotype-based cis-expression prediction. To address this, we introduce the block-sparse Bayesian sparse linear mixed model (bsBSLMM). This method extends the existing Bayesian sparse linear mixed model (BSLMM) framework by integrating two key innovations: spike-and-slab sparsity defined by linkage disequilibrium (LD) blocks and a prior for SNP inclusion that is informed by transcription start site (TSS) locations.
In an evaluation involving 23,098 genes derived from GEUVADIS lymphoblastoid cell lines of European ancestry, bsBSLMM outperformed several established methods, including BSLMM, LASSO, BLUP, TIGAR elastic net, and TIGAR Dirichlet-process regression. Under consistent evaluation criteria, bsBSLMM successfully retained a higher number of predictable genes. When compared directly to BSLMM, bsBSLMM demonstrated superior prediction performance on held-out data for the majority of shared genes. These improvements were primarily attributed to the incorporation of LD-block sparsity, with additional gains provided by the TSS-informed prior.
The biological relevance of the variants selected by bsBSLMM was evident in their stronger enrichment within regulatory regions, specifically GM12878 DNase and H3K27ac sites, compared to variants chosen by the standard BSLMM. Furthermore, in transcriptome-wide association study (TWAS) analyses, bsBSLMM not only recovered known inflammatory bowel disease signals, such as those linked to IL23R, but also identified additional genome-wide significant genes that BSLMM failed to detect.
The robustness of these findings was confirmed through independent validation in the Louisiana Osteoporosis Study. This analysis replicated the increased prediction yield across diverse ancestries and uncovered biologically significant bone mineral density pathways in subsequent TWAS and gene set enrichment analyses. Collectively, these results indicate that integrating LD-block structures and biologically grounded SNP priors significantly enhances both cis-expression prediction accuracy and the discovery power of downstream TWAS.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





