PSViT: A Methodology for Structurally Pruning Spiking Vision Transformers
Title: PSViT: A Framework for Structurally Pruning Spiking Vision Transformers
Abstract:
Spiking Vision Transformers (SViTs) have emerged as highly promising, energy-efficient alternatives to traditional ViTs, delivering state-of-the-art results in vision-based tasks. However, the substantial model size of these architectures hinders their deployment on embedded platforms with limited resources, highlighting an urgent need for effective model compression. While pruning is a leading compression strategy, current state-of-the-art methods primarily rely on unstructured pruning. This approach necessitates specialized hardware designed to handle sparse data patterns to achieve maximum efficiency, rendering it impractical for scalable implementation.
To overcome these limitations, we introduce PSViT, a novel methodology designed to perform structured pruning on SViT models. This approach enables efficient inference acceleration by leveraging existing, widely adopted computing architectures. PSViT operates through a multi-step process: it begins with uniform channel-wise filter pruning to structurally remove insignificant weights, followed by a sensitivity analysis to assess how pruning individual layers affects both network size and accuracy. Finally, it applies fine-grained channel-wise pruning, guided by the sensitivity results and the specific network architecture.
Experimental evaluations demonstrate that PSViT achieves a 22.4% reduction in memory usage via a single-shot pruning process. Crucially, it preserves high accuracy, deviating by no more than 3% from the original non-pruned SViT model, which scored 73.3% on ImageNet-1K. Specifically, the pruned model achieved 70.3% accuracy without fine-tuning and 72.8% with fine-tuning. These findings underscore PSViT’s potential to facilitate the efficient deployment of SViTs in resource-constrained environments.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



