arXiv

Towards Pretraining Text Encoders for TabPFN

Title: Pretraining Text Encoders for TabPFN

Original: arXiv:2606.04876v1 Announce Type: new Abstract: Tabular foundation models, such as TabPFN, achieve strong performance on tabular datasets with numerical and categorical data, but do not natively handle high-cardinality text features. Standard pipelines, therefore, embed text with a language model and compress the resulting vectors with PCA into a small number of scalar features before inputting them into TabPFN. This creates an information bottleneck: most embedding dimensions are discarded, and the compressed representation must then be expanded again by TabPFN's feature encoder. End-to-end alternatives can avoid PCA, but they require large amounts of pretraining data containing text cells and usually perform subpar compared to tabular foundation models that were pretrained on large amounts of synthetic data. Inspired by modality-alignment approaches like LLaVA (vision-to-LLM token projection) and TableGPT-style systems (table-to-LLM token projection), we introduce the TabPFN Text Adapter (text-to-TFM token projection). We freeze both the sentence encoder and TabPFN, and train only a lightweight adapter that maps text embeddings into a short sequence of tokens in TabPFN's embedding space. This design removes the PCA bottleneck, preserves TabPFN's numerical strengths, and is more efficient to train than end-to-end text-tabular pipelines.

Rewrite: Title: Advancing Text Encoder Pretraining for TabPFN

Abstract: While foundation models designed for tabular data, including TabPFN, excel at processing datasets containing both numerical and categorical variables, they lack native support for high-cardinality text features. Consequently, conventional workflows typically utilize a language model to generate text embeddings, which are then reduced to a few scalar features via Principal Component Analysis (PCA) prior to being fed into TabPFN. This methodology introduces an information bottleneck, as the majority of embedding dimensions are eliminated, forcing TabPFN’s feature encoder to reconstruct the compressed data. Although end-to-end methods can bypass PCA, they demand extensive pretraining datasets comprising text entries and generally underperform relative to tabular foundation models trained on vast quantities of synthetic data. Drawing inspiration from modality-alignment techniques such as LLaVA (which projects vision data to LLM tokens) and TableGPT (which projects table data to LLM tokens), we propose the TabPFN Text Adapter, a mechanism for projecting text into TabPFN’s token space. By keeping both the sentence encoder and TabPFN frozen, we train exclusively a lightweight adapter that translates text embeddings into a concise sequence of tokens within TabPFN’s embedding framework. This approach eliminates the PCA constraint, maintains the numerical efficacy of TabPFN, and offers greater training efficiency compared to full end-to-end text-tabular pipelines.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

Glazer Family Members Said to Study Manchester United Stake Sale
Bloomberg

Glazer Family Members Said to Study Manchester United Stake Sale

Reports indicate the Glazer family is evaluating a potential sale of their Manchester United stake, with family members ...

Ares' Blair Jacbobson: Disconnect Over Private Credit Headlines
Bloomberg

Ares' Blair Jacbobson: Disconnect Over Private Credit Headlines

Ares’ Blair Jacobson argues that private credit headlines misrepresent reality, highlighting a disconnect between media ...

Nvidia-Backed Robotics Startup Generalist AI Valued at $2 Billion
Bloomberg

Nvidia-Backed Robotics Startup Generalist AI Valued at $2 Billion

Nvidia-backed robotics startup Generalist AI has reached a $2 billion valuation. Founders Pete Florence, Andy Zeng, and ...

TechCrunch

Oura Ring 5 review: Thinner, lighter, better

The Oura Ring 5 is 40% smaller and lighter than its predecessor, offering superior comfort and a discreet, jewelry-like ...

Financial Times

How AI has de-skilled translation

AI fragments specialist translation into routine tasks, effectively de-skilling the profession. This shift reduces compl...

Zurich Insurance Expands Data-Center Offering Beyond the US
Bloomberg

Zurich Insurance Expands Data-Center Offering Beyond the US

Zurich Insurance Group is expanding its data center insurance products internationally, extending coverage beyond the Un...