arXiv

LCSHBench: A Multilingual, Consensus-Grounded Benchmark for Library of Congress Subject Heading Assignment

Title: LCSHBench: A Multilingual, Consensus-Grounded Benchmark for Library of Congress Subject Heading Assignment

Abstract

Automated subject cataloging involves assigning controlled-vocabulary headings to bibliographic records; however, the Library of Congress Subject Headings (LCSH) framework currently lacks a standard public benchmark. To address this gap, we present LCSHBench, a dataset comprising 22,346 books across 15 languages, sourced from the openly licensed catalogs of Harvard, Columbia, and Princeton universities. Our inclusion criteria are rigorous: a record is added only if at least two independent cataloging agencies have assigned LCSH headings. We provide per-catalog provenance data alongside both union and unanimous answer views.

The necessity of this design is highlighted by a concordance study of 465,187 works cataloged by all three libraries. The analysis reveals that while libraries frequently diverge in exact heading expression—only 39.4% of works feature identical heading sets—they largely converge on underlying topics, with 93.3% sharing a concept-level heading. Consequently, LCSHBench evaluates performance using both exact and concept matches, employing set and rank metrics that are stratified by language and heading type across open-vocabulary generation and full-vocabulary retrieval tasks.

In an initial demonstration, a low-rank fine-tuned 300M-parameter on-device embedder demonstrated improved cross-lingual retrieval capabilities. It surpassed a 3,072-dimensional hosted embedder on the development set, achieving an exact recall@200 of 0.659 compared to 0.623. However, the language panel indicates that these gains are not uniform across all languages. Further validation through held-out tests and end-to-end confirmation remains a priority for future work.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

AI Concentration Risk Is the Problem: 3-Minutes MLIV
Bloomberg

AI Concentration Risk Is the Problem: 3-Minutes MLIV

The article argues that AI concentration risk, rather than the technology itself, is the primary concern. It highlights ...

Reuters

Foxconn announces strategic collaboration with Intel on next-gen AI infrastructure

Foxconn and Intel announced a strategic partnership to develop next-generation AI infrastructure. This collaboration aim...

SpaceX Seeks to Raise $75 Billion in Record IPO (Video)
Bloomberg

SpaceX Seeks to Raise $75 Billion in Record IPO (Video)

SpaceX aims for a record $75 billion valuation through an initial public offering. This historic IPO marks a significant...

Broadcom AI Chip Outlook Disappoints Investors
Bloomberg

Broadcom AI Chip Outlook Disappoints Investors

Broadcom’s AI chip projections disappointed investors, dampening market sentiment. The outlook fell short of expectation...

Reuters

Europe's tech 'liberation day'? Computer says not yet

Europe’s expected tech breakthrough remains unrealized, as current systems indicate that a true "liberation day" has not...

Hiranandani Group CEO on Powering India's Digital Future
Bloomberg

Hiranandani Group CEO on Powering India's Digital Future

Hiranandani Group CEO discusses driving India's digital transformation.