arXiv

MimeLens: Position-Agnostic Content-Type Detection for Binary Fragments

Title: MimeLens: Position-Agnostic Content-Type Detection for Binary Fragments

Abstract

Accurate file-type classification is a foundational component in numerous operational workflows, including forensic carving, packet inspection, storage indexing, and malware triage. However, existing learned systems, such as Google’s Magika, rely on the assumption of whole-file access at a known offset. This limitation causes them to fail when processing the fragmented inputs typical of these tasks, such as chunked uploads, random disk blocks, header-less carved fragments, or individual packet payloads.

To address this, we present MimeLens, a suite of compact, BERT-style encoders. These models are pretrained on binary data sampled from windows located at uniformly random offsets within each file, eliminating any reliance on privileged head-of-file positioning. The framework offers both standard- and short-context variants. MimeLens accepts byte chunks from any location within a file, requiring no headers and no fixed input sizes, ultimately outputting one of 125 MIME labels from libmagic.

In benchmarking on the clean heads of complete files, MimeLens outperforms Magika v1.1 by 10.7 percentage points in top-1 accuracy on libmagic-labeled data. Furthermore, it maintains classification capabilities where Magika fails, such as identifying content from a single mid-stream UDP packet. On random mid-file disk blocks, MimeLens demonstrates more than twice the accuracy of both libmagic and Magika.

The primary trade-off involves latency; MimeLens operates approximately one to two orders of magnitude slower per sample on CPUs compared to Magika. However, performance parity is achieved on consumer GPUs or in batch processing modes. All trained checkpoints have been released on Hugging Face under the identifier (mjbommar/mimelens-001-*).


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

SpaceX Seeks to Raise $75 Billion in Record IPO (Video)
Bloomberg

SpaceX Seeks to Raise $75 Billion in Record IPO (Video)

SpaceX aims for a record $75 billion valuation through an initial public offering. This historic IPO marks a significant...

Broadcom AI Chip Outlook Disappoints Investors
Bloomberg

Broadcom AI Chip Outlook Disappoints Investors

Broadcom’s AI chip projections disappointed investors, dampening market sentiment. The outlook fell short of expectation...

Hiranandani Group CEO on Powering India's Digital Future
Bloomberg

Hiranandani Group CEO on Powering India's Digital Future

Hiranandani Group CEO discusses driving India's digital transformation.

Cerebras Says It’s Working With All AI Gear Makers Except Nvidia
Bloomberg

Cerebras Says It’s Working With All AI Gear Makers Except Nvidia

Cerebras confirmed partnerships with all major AI hardware vendors except Nvidia. This broad engagement positions Cerebr...

Putin Turns Russia’s AI Future Into a Kremlin Family Business
Bloomberg

Putin Turns Russia’s AI Future Into a Kremlin Family Business

Putin is consolidating Russia’s AI ambitions into a Kremlin family business, effectively turning the sector into a dynas...

Reuters

Meta repeatedly pushes back new AI model release for developers, WSJ says

Meta has repeatedly delayed the release of its new AI model for developers, according to the WSJ. This ongoing postponem...