Large Byte Model: Teaching Language Models About Compiled Code
Title: Large Byte Model: Teaching Language Models About Compiled Code
Abstract:
Malware analysis begins with the raw bytes of an executable program. Traditionally, tools that "lift" these bytes into higher-level representations, such as assembly code, are costly and prone to errors. However, standard Large Language Models (LLMs) are unable to process raw byte data or answer queries regarding them. To address this limitation, we introduce the first byte-native LLM. By employing a custom byte tokenizer and a vocabulary expansion technique, this model can accurately answer complex questions about malware binaries. It achieves accuracy rates ranging from 69% for malware family classification to 98% for architecture classification. Our results demonstrate that incorporating domain-specific knowledge during the training phase is critical for this use case; off-the-shelf models fail to deliver the necessary accuracy or insight. We have deployed this emerging solution to a select group of analysts to collect feedback for future enhancements.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



