arXiv

Pretraining Language Models on Historical Text

June 3, 2026 · Xiaoxi Luo, Zachary Shinnick, Niclas Griesshaber, Yixuan Wang, Junchi Yu, Freda Shi, Philip Torr, Yao Lu · Original Source

Title: Training Language Models on Historical Text

Abstract:

This paper presents TypewriterLM, a 7.24-billion-parameter language model trained solely on English texts published before 1913. The development of History LMs necessitates tackling several critical hurdles, including ensuring data quality and accessibility, avoiding temporal leakage, creating post-training pipelines that maintain temporal consistency, and establishing robust evaluation methods. To overcome these obstacles, we have compiled TypewriterCorpus, a massive 54B-token dataset drawn from varied archival and linguistically annotated sources. This corpus underwent rigorous cleaning and leakage mitigation processes. Additionally, we propose lexically grounded instructing tuning, a post-training approach that ensures model responses are strictly anchored in historical source documents. Leveraging this framework, we created two new instruction tuning datasets for historical contexts: History-LIMA and History-SelfInstruct. To assess both capability and temporal consistency, we also introduce History-Event, a benchmark suite designed to measure competence, temporal grounding, and detect data leakage. We are making TypewriterLM and all related resources publicly available to facilitate ongoing research into historical language models.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC