MimeLens: Position-Agnostic Content-Type Detection for Binary Fragments
Title: MimeLens: Position-Agnostic Content-Type Detection for Binary Fragments
Abstract
Accurate file-type classification is a foundational component in numerous operational workflows, including forensic carving, packet inspection, storage indexing, and malware triage. However, existing learned systems, such as Google’s Magika, rely on the assumption of whole-file access at a known offset. This limitation causes them to fail when processing the fragmented inputs typical of these tasks, such as chunked uploads, random disk blocks, header-less carved fragments, or individual packet payloads.
To address this, we present MimeLens, a suite of compact, BERT-style encoders. These models are pretrained on binary data sampled from windows located at uniformly random offsets within each file, eliminating any reliance on privileged head-of-file positioning. The framework offers both standard- and short-context variants. MimeLens accepts byte chunks from any location within a file, requiring no headers and no fixed input sizes, ultimately outputting one of 125 MIME labels from libmagic.
In benchmarking on the clean heads of complete files, MimeLens outperforms Magika v1.1 by 10.7 percentage points in top-1 accuracy on libmagic-labeled data. Furthermore, it maintains classification capabilities where Magika fails, such as identifying content from a single mid-stream UDP packet. On random mid-file disk blocks, MimeLens demonstrates more than twice the accuracy of both libmagic and Magika.
The primary trade-off involves latency; MimeLens operates approximately one to two orders of magnitude slower per sample on CPUs compared to Magika. However, performance parity is achieved on consumer GPUs or in batch processing modes. All trained checkpoints have been released on Hugging Face under the identifier (mjbommar/mimelens-001-*).
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC





