LDARNet: DNA Adaptive Representation Network with Learnable Tokenization for Genomic Modeling
Title: LDARNet: A DNA Adaptive Representation Network Utilizing Learnable Tokenization for Genomic Modeling
Abstract:
While genomic foundation models are increasingly mirroring the architectures of large language models, they predominantly depend on static tokenization methods like $k$-mers, Byte Pair Encoding (BPE), or individual nucleotides. These fixed schemes impose arbitrary sequence divisions that can potentially mask biologically significant structural features. To address this, we introduce LDARNet, a hierarchical genomic foundation model comprising 120 million parameters. This architecture adapts the dynamic chunking mechanism of H-Net—originally designed for autoregressive generation—to the realm of masked language modeling. LDARNet integrates BiMamba-2 state-space layers with local attention, bidirectional routing, and a ratio-based regularizer to facilitate unsupervised, adaptive token boundary formation.
Evaluated across 27 tasks from the Genomic Benchmarks and Nucleotide Transformer suites, LDARNet secured 11 out of 18 victories among compact models (defined as those with fewer than 300 million parameters). It also achieved state-of-the-art performance on five histone modification tasks, surpassing models that are up to 20 times larger in size. A controlled experiment matching FLOPs identified learned routing as the primary driver of these improvements: at equivalent computational costs, the model’s learned boundaries outperformed fixed-grid boundaries by as much as 14 percentage points on histone tasks. Furthermore, nucleotide-resolution analysis revealed that these unsupervised learned boundaries correspond with canonical promoter motifs and splice junctions, offering a clear biological interpretation for the efficacy of adaptive tokenization in genomic foundation models.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC





