arXiv

Residual Decoder Adapter: ID-Preserving Tokenizer Adaption for Autoregressive Text Rendering

Title: Residual Decoder Adapter: ID-Preserving Tokenizer Adaption for Autoregressive Text Rendering

Original: arXiv:2606.01911v1 Announce Type: new Abstract: Visual Autoregressive (AR) models generate images by predicting discrete tokens that are decoded by a visual tokenizer. Despite demonstrating strong overall image generation ability, they still underperform on text rendering with blur strokes and disrupt letter shapes. In this work, we trace this limitation to the visual tokenizer, which struggles to reconstruct fine-grained detail. Improving the tokenizer is straightforward but expensive, as it necessitates retraining both the tokenizer and the AR model. Can we improve text rendering performance of AR models without retraining the existing tokenizer and AR model? To achieve this, we propose the Residual Decoder Adapter(RDA) that upgrades an existing tokenizer post-hoc without changing its token space. Specifically, it refines the decoder output of the visual tokenizer by introducing two novel components: (i) a paired codebook that shares the token distribution with the original one; (ii) a parallel branch to learn the tiny differences (residual) between the reconstructed image and the ground-truth images in the pixel space. This residual design allows us to enhance the tokenizer non-invasively while preserving compatibility with prior AR models. RDA substantially improves text rendering significantly by a large margin. For instance, we boost finetuned Janus-Pro OCR accuracy rises from 24.52% to 58.26% (TextVisionBlend), from 12.75% to 36.81% (StyledTextSynth) on competitive TextAtlas benchmark. The code is available at https://github.com/CSU-JPG/RDA

Rewrite:

Title: Residual Decoder Adapter: ID-Preserving Tokenizer Adaption for Autoregressive Text Rendering

Original: arXiv:2606.01911v1 Announce Type: new

Abstract: Visual Autoregressive (AR) systems produce images by forecasting discrete tokens, which are subsequently interpreted by a visual tokenizer. Although these models exhibit robust general image synthesis capabilities, they frequently falter when rendering text, often resulting in blurred strokes and distorted character forms. This study identifies the visual tokenizer as the root cause of this deficiency, noting its inability to accurately reconstruct high-frequency details. While enhancing the tokenizer directly is a logical solution, it is cost-prohibitive because it requires the joint retraining of both the tokenizer and the AR model. Is it possible to bolster the text rendering quality of AR models without undergoing such extensive retraining? To address this challenge, we introduce the Residual Decoder Adapter (RDA), a method that enhances an existing tokenizer retrospectively without altering its fundamental token space. RDA improves the visual tokenizer's decoder output through two innovative mechanisms: (i) a secondary codebook that mirrors the distribution of the original codebook, and (ii) a parallel processing stream designed to capture minute residuals—the pixel-level discrepancies between the generated and ground-truth images. This approach enables a non-invasive upgrade to the tokenizer, ensuring seamless compatibility with established AR architectures. Consequently, RDA delivers substantial gains in text rendering fidelity. On the rigorous TextAtlas benchmark, applying RDA to a fine-tuned Janus-Pro model increased OCR accuracy from 24.52% to 58.26% on TextVisionBlend and from 12.75% to 36.81% on StyledTextSynth. The implementation code is publicly accessible at https://github.com/CSU-JPG/RDA


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...