A Lightweight Context-Driven Training-Free Network for Scene Text Segmentation and Recognition
Title: A Lightweight Context-Driven Training-Free Network for Scene Text Segmentation and Recognition
Abstract:
Current scene text recognition (STR) systems frequently rely on substantial end-to-end architectures that demand significant training efforts. These models are often too resource-intensive for real-time applications, making their deployment unfeasible due to strict limitations on latency, memory, and computational power. To overcome these hurdles, we introduce a novel, plug-and-play framework that operates without training. This approach capitalizes on the capabilities of pre-trained text recognizers while reducing redundant processing.
Our method employs context-aware mechanisms and incorporates an attention-based segmentation module to refine candidate text regions at the pixel level, thereby enhancing the accuracy of subsequent recognition tasks. Rather than executing conventional text detectionāwhich typically involves block-level comparisons between feature maps and the source imageāor utilizing pretrained captioners to extract contextual data, our framework generates word predictions directly from the scene's context. Candidate texts are then assessed based on semantic and lexical criteria to determine a final score.
If a prediction achieves or surpasses a specific confidence threshold, it bypasses the computationally heavy end-to-end STR profiling process. This optimization significantly accelerates inference times and minimizes unnecessary calculations. Evaluations on public benchmarks indicate that our proposed paradigm delivers performance comparable to state-of-the-art systems while consuming substantially fewer resources.
Our code is available at: https://ritabrata04.github.io/Context-driven-STR/.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




