arXiv

Digging Up Citations: FOSSIL, a Dataset and Workflow for Reference Extraction in Law and the Humanities

Title: Unearthing Citations: FOSSIL, a Dataset and Workflow for Reference Extraction in Law and the Humanities

Citation extraction systems are typically optimized for the structured bibliographies found at the end of natural science papers. However, scholarship in the humanities and law relies heavily on footnotes, where bibliographic details are mixed with commentary and cross-references, exhibiting significant variation across different languages and formatting styles. To tackle the lack of high-quality training data for these formats, we introduce FOSSIL (Footnote-based Open-access SSH Scientific Instance Labels). This resource is an openly licensed, multilingual dataset comprising 96 annotated scholarly articles, which include more than 7,600 references embedded within footnotes.

The release also includes PDF-TEI Editor, a collaborative web-based annotation tool, a documented workflow involving seven annotators, and a specialized version of Grobid tailored for footnote-based citations. Our end-to-end evaluation demonstrates that this specialized pipeline significantly outperforms the default Grobid model, nearly doubling the extraction quality with a micro-F1 score increase from 0.36 to 0.72. This improvement is primarily attributed to better recall. Nevertheless, the results indicate that there is still considerable room for advancement, particularly regarding cross-references and footnotes containing mixed content. This extended abstract outlines work that is currently in progress, with ongoing efforts focused on the segmentation, parsing, and resolution of citations.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...