arXiv

Spike-Aware C++ INT8 Inference for Sparse Spiking Language Models on Commodity CPUs

June 3, 2026 · Ting Liu · Original Source

Title: Optimizing C++ INT8 Inference for Sparse Spiking Language Models on Standard Processors

Abstract

Unlike dense Transformer architectures, Spiking Language Models (SLMs) inherently generate activation sparsity, a characteristic that traditional runtimes often fail to leverage effectively. This study examines this property through a systems engineering lens. Leveraging the SymbolicLight V1 spike-gated language model family, we have developed a C++ inference runtime for commodity CPUs that integrates sparse binary spike states directly into the execution pipeline, moving beyond mere post-hoc weight compression.

The proposed runtime architecture features a manifest-driven weight loader, a hybrid row/column memory layout, and optimized AVX2/FMA kernels. It utilizes per-channel symmetric INT8 quantization alongside integer-domain accumulation to handle spike-conditioned sparse pathways efficiently. Performance evaluations were conducted on an AMD Ryzen 7 5800X processor. An initial scalar FP32 baseline achieved a decoding speed of 9.5 tokens/s. By employing a mixed-layout AVX2 FP32 approach, this throughput increased to 14.7 tokens/s. Further optimization using AVX2 INT8 on the same step-30k model export boosted performance to 19.9 tokens/s, while simultaneously shrinking the weight footprint from 3.49 GB to 1.06 GB.

In benchmarks involving an 874M-parameter INT8 export trained for 186k steps, our C++ runtime decoded at 22.63 tokens/s on a single CPU thread. This performance surpasses several comparable models running under llama.cpp: TinyLlama-1.1B Q8_0 achieved 16.31 tokens/s, Falcon3-1B Q8_0 reached 11.26 tokens/s, and Qwen2.5-1.5B Q8_0 managed 9.70 tokens/s. Multithreading significantly enhanced throughput, reaching 47.90 tokens/s with four CPU threads. Additionally, 512-token prefill speeds improved from 29.86 tokens/s on one thread to 94.68 tokens/s across eight threads.

However, these efficiency gains come at the expense of model quality. The SLM recorded a WikiText-2 perplexity of 24.80, which is inferior to the dense baselines included in the same evaluation suite. We present these findings as a case study for inference systems focused on sparse language runtimes. The long-term objective is to support embodied and edge agents, which could benefit from localized, low-core inference capabilities situated near sensors and actuators. While spike-aware execution demonstrates clear advantages for CPU throughput and memory management in sparse spiking models, several challenges remain unresolved, including model quality refinement, controlled dense training comparisons, embodied-task assessments, and precise CPU energy measurements.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC