arXiv

Spike-Aware C++ INT8 Inference for Sparse Spiking Language Models on Commodity CPUs

Title: Optimizing C++ INT8 Inference for Sparse Spiking Language Models on Standard Processors

Abstract

Unlike dense Transformer architectures, Spiking Language Models (SLMs) inherently generate activation sparsity, a characteristic that traditional runtimes often fail to leverage effectively. This study examines this property through a systems engineering lens. Leveraging the SymbolicLight V1 spike-gated language model family, we have developed a C++ inference runtime for commodity CPUs that integrates sparse binary spike states directly into the execution pipeline, moving beyond mere post-hoc weight compression.

The proposed runtime architecture features a manifest-driven weight loader, a hybrid row/column memory layout, and optimized AVX2/FMA kernels. It utilizes per-channel symmetric INT8 quantization alongside integer-domain accumulation to handle spike-conditioned sparse pathways efficiently. Performance evaluations were conducted on an AMD Ryzen 7 5800X processor. An initial scalar FP32 baseline achieved a decoding speed of 9.5 tokens/s. By employing a mixed-layout AVX2 FP32 approach, this throughput increased to 14.7 tokens/s. Further optimization using AVX2 INT8 on the same step-30k model export boosted performance to 19.9 tokens/s, while simultaneously shrinking the weight footprint from 3.49 GB to 1.06 GB.

In benchmarks involving an 874M-parameter INT8 export trained for 186k steps, our C++ runtime decoded at 22.63 tokens/s on a single CPU thread. This performance surpasses several comparable models running under llama.cpp: TinyLlama-1.1B Q8_0 achieved 16.31 tokens/s, Falcon3-1B Q8_0 reached 11.26 tokens/s, and Qwen2.5-1.5B Q8_0 managed 9.70 tokens/s. Multithreading significantly enhanced throughput, reaching 47.90 tokens/s with four CPU threads. Additionally, 512-token prefill speeds improved from 29.86 tokens/s on one thread to 94.68 tokens/s across eight threads.

However, these efficiency gains come at the expense of model quality. The SLM recorded a WikiText-2 perplexity of 24.80, which is inferior to the dense baselines included in the same evaluation suite. We present these findings as a case study for inference systems focused on sparse language runtimes. The long-term objective is to support embodied and edge agents, which could benefit from localized, low-core inference capabilities situated near sensors and actuators. While spike-aware execution demonstrates clear advantages for CPU throughput and memory management in sparse spiking models, several challenges remain unresolved, including model quality refinement, controlled dense training comparisons, embodied-task assessments, and precise CPU energy measurements.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TikTok Billionaire Tops Ambani as Asia’s Second-Richest
Bloomberg

TikTok Billionaire Tops Ambani as Asia’s Second-Richest

TikTok founder surpasses Mukesh Ambani to become Asia’s second-richest person, marking a significant shift in the region...

Publishers in UK can opt out of Google AI search results
BBC News

Publishers in UK can opt out of Google AI search results

UK publishers can now opt out of Google’s AI search summaries, a CMA ruling designed to boost their bargaining power and...

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.
Bloomberg

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.

Kioxia’s market cap nears Toyota’s, signaling a major shift in Japan’s corporate hierarchy. This narrowing gap highlight...

Reuters

Morning Bid: Marvell, a fitting name for the latest AI darling

Reuters highlights Marvell as a top AI stock, noting its name perfectly suits its status as the newest market darling.

Financial Times

Tim Hayward: I built the Jaguar E-Type of computer keyboards

Tim Hayward compares his bespoke keyboard designs to the Jaguar E-Type. He explores high-end customization for personal ...

Financial Times

AI Labs: Zuckerberg’s $100bn gamble

Meta’s $100 billion AI investment aims to secure AI dominance, but questions remain whether sheer spending can outpace c...