arXiv

PubTables-v2: A new large-scale dataset for full-page and multi-page table extraction

June 3, 2026 · Brandon Smock, Valerie Faucon-Morin, Max Sokolov, Libin Liang, Tayyibah Khanam, Amrit Ramesh, Maury Courtland · Original Source

Title: PubTables-v2: A Scalable New Dataset for Full-Page and Multi-Page Table Extraction

Original: arXiv:2512.10888v3 Announce Type: replace Abstract: Table extraction (TE) is a key challenge in document understanding. Traditional approaches detect tables first, then recognize their structure. Recently, interest has surged in developing methods, such as vision-language models (VLMs), to extract tables directly in their full page or document context. However, a lack of annotated data has made progress difficult to demonstrate. To address this, we create a new large-scale dataset, PubTables-v2. PubTables-v2 unifies TE across various levels of surrounding context and, notably, is the first benchmark for multi-page TE. Our evaluations reveal that while current frontier models strongly outperform ($+0.354\ \textrm{GriTS}\textrm{Con}$) small models on the most complex task (full-document multi-page TE), this gap can be closed or even reversed ($-0.056\ \textrm{GriTS}\textrm{Con}$) on narrower tasks (cropped table extraction) with targeted training. Data is available at https://huggingface.co/datasets/kensho/PubTables-v2. Code and models will be released.

Rewrite:

Table extraction (TE) remains a pivotal hurdle in the field of document understanding. Historically, methodologies have relied on a two-step process: first identifying table regions and subsequently analyzing their internal structures. In contrast, recent attention has shifted toward techniques like vision-language models (VLMs) that aim to extract tables directly within the broader context of an entire page or document. Despite this growing interest, the advancement of such methods has been hindered by a scarcity of annotated training data.

To overcome this limitation, we introduce PubTables-v2, a comprehensive new dataset. This resource standardizes table extraction across diverse contextual scopes and stands out as the inaugural benchmark specifically designed for multi-page table extraction. Our experimental results indicate that while state-of-the-art large models significantly surpass smaller models ($+0.354\ \textrm{GriTS}\textrm{Con}$) in the most demanding scenario of full-document, multi-page TE, this performance disparity can be mitigated or even inverted ($-0.056\ \textrm{GriTS}\textrm{Con}$) for more focused tasks, such as cropped table extraction, through specialized training strategies. The dataset can be accessed at https://huggingface.co/datasets/kensho/PubTables-v2. Furthermore, we plan to release the associated code and models.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC