arXiv

LCSHBench: A Multilingual, Consensus-Grounded Benchmark for Library of Congress Subject Heading Assignment

June 4, 2026 · Kwok Leong Tang · Original Source

Title: LCSHBench: A Multilingual, Consensus-Grounded Benchmark for Library of Congress Subject Heading Assignment

Abstract

Automated subject cataloging involves assigning controlled-vocabulary headings to bibliographic records; however, the Library of Congress Subject Headings (LCSH) framework currently lacks a standard public benchmark. To address this gap, we present LCSHBench, a dataset comprising 22,346 books across 15 languages, sourced from the openly licensed catalogs of Harvard, Columbia, and Princeton universities. Our inclusion criteria are rigorous: a record is added only if at least two independent cataloging agencies have assigned LCSH headings. We provide per-catalog provenance data alongside both union and unanimous answer views.

The necessity of this design is highlighted by a concordance study of 465,187 works cataloged by all three libraries. The analysis reveals that while libraries frequently diverge in exact heading expression—only 39.4% of works feature identical heading sets—they largely converge on underlying topics, with 93.3% sharing a concept-level heading. Consequently, LCSHBench evaluates performance using both exact and concept matches, employing set and rank metrics that are stratified by language and heading type across open-vocabulary generation and full-vocabulary retrieval tasks.

In an initial demonstration, a low-rank fine-tuned 300M-parameter on-device embedder demonstrated improved cross-lingual retrieval capabilities. It surpassed a 3,072-dimensional hosted embedder on the development set, achieving an exact recall@200 of 0.659 compared to 0.623. However, the language panel indicates that these gains are not uniform across all languages. Further validation through held-out tests and end-to-end confirmation remains a priority for future work.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC