arXiv

Vision Transformers and Convolutional Neural Networks for Land Use Scene Classification

June 4, 2026 · Arun D. Kulkarni · Original Source

Title: Comparative Analysis of Vision Transformers and Convolutional Neural Networks in Land Use Scene Classification

Remote sensing-based Land Use Scene Classification (LUSC) is a pivotal component in managing sustainable resources, urban development, and environmental monitoring. While Convolutional Neural Networks (CNNs) have long led the field due to their proficiency in extracting local spatial features, the advent of Vision Transformers (ViTs) has shifted the paradigm. ViTs leverage self-attention mechanisms to model long-range dependencies, offering the potential for a deeper understanding of global context.

This study conducts a comparative evaluation of CNN-based architectures against Vision Transformers for LUSC tasks. Using benchmark datasets such as the UC Merced Land Use and EuroSAT Land Use collections, we assessed representative models, including AlexNet and the Vision Transformer. The analysis focused on key performance metrics: classification accuracy, precision, recall, F1-score, and computational complexity.

The experimental outcomes reveal distinct advantages for each architecture depending on the data context. CNNs demonstrated robust performance on datasets characterized by strong local textures and limited training samples. In contrast, Vision Transformers excelled at capturing global spatial relationships within complex scenes, provided that ample training data was available. However, ViTs generally demand higher computational resources and larger datasets to reach their full potential. These insights highlight the specific strengths and constraints of both approaches, offering practical guidance for selecting the most suitable model for remote sensing applications.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC