T-CLIP: Enabling Thermal Perception for Contrastive Language-Image Pretraining
Title: T-CLIP: Facilitating Thermal Perception in Contrastive Language-Image Pretraining
Abstract: While thermal imaging serves as a robust alternative to visible-spectrum vision in difficult environments characterized by poor lighting and adverse weather, established vision-language models such as CLIP struggle to bridge the gap between thermal imagery and text. This disconnect stems from a fundamental deficit in thermal perception. Our analysis highlights three primary obstacles: the scarcity of captioned thermal datasets, the inability of conventional Large Language Models (LLMs) to reason regarding thermal phenomena, and a complex representational issue where global scene context and object-specific heat signatures compete within a unified embedding space. To overcome these hurdles, we present IR-Cap, the inaugural physics-aware pipeline and dataset for thermal captioning, which delivers both comprehensive global and detailed fine-grained thermal descriptions across three public benchmarks. Additionally, we propose T-CLIP, a novel decoupled dual-LoRA framework that separately adapts CLIP to achieve scene-level and object-level thermal comprehension. Empirical results demonstrate that T-CLIP consistently outperforms existing baselines in cross-modal retrieval tasks across all three thermal benchmarks. Furthermore, we showcase the model's potential through an exploratory application in text-conditioned thermal image generation.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





