Digging Up Citations: FOSSIL, a Dataset and Workflow for Reference Extraction in Law and the Humanities
Title: Unearthing Citations: FOSSIL, a Dataset and Workflow for Reference Extraction in Law and the Humanities
Citation extraction systems are typically optimized for the structured bibliographies found at the end of natural science papers. However, scholarship in the humanities and law relies heavily on footnotes, where bibliographic details are mixed with commentary and cross-references, exhibiting significant variation across different languages and formatting styles. To tackle the lack of high-quality training data for these formats, we introduce FOSSIL (Footnote-based Open-access SSH Scientific Instance Labels). This resource is an openly licensed, multilingual dataset comprising 96 annotated scholarly articles, which include more than 7,600 references embedded within footnotes.
The release also includes PDF-TEI Editor, a collaborative web-based annotation tool, a documented workflow involving seven annotators, and a specialized version of Grobid tailored for footnote-based citations. Our end-to-end evaluation demonstrates that this specialized pipeline significantly outperforms the default Grobid model, nearly doubling the extraction quality with a micro-F1 score increase from 0.36 to 0.72. This improvement is primarily attributed to better recall. Nevertheless, the results indicate that there is still considerable room for advancement, particularly regarding cross-references and footnotes containing mixed content. This extended abstract outlines work that is currently in progress, with ongoing efforts focused on the segmentation, parsing, and resolution of citations.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





