arXiv

Retrieval and competition: how a protein foundation model starts a protein

Title: Retrieval and competition: how a protein foundation model starts a protein

Abstract:

While protein language models are increasingly deployed to steer clinical and experimental choices, it remains ambiguous whether high-confidence outputs stem from genuine biological recognition or merely the retrieval of statistical defaults. This study investigates this distinction by analyzing a near-universal biological principle: the initiation of proteins with methionine. Specifically, we trace the computational mechanisms ESM2-8M employs to generate this prediction. Our analysis reveals that the model does not directly identify methionine at the masked site. Instead, it utilizes a position-specific query, constructed across multiple layers, to retrieve a methionine-biasing signal from a reference representation located at the sequence’s beginning token. The final prediction arises from the competition between this retrieval signal and context-dependent circuits.

To elucidate how positional data influences the final readout, we propose a norm-direction decomposition of attention scores within rotary frequency bands. We find that positional encoding functions via synchronized shifts in query norms and angular alignment distributed across these frequency bands. Notably, on sequences where the actual N-terminus is not methionine—a scenario with significant biological implications—the model still predicts methionine. This outcome is not an accurate prediction derived from an anomalous mechanism, but rather the result of a positional-prior retrieval circuit that aligns with the statistical average, thereby failing when biological reality diverges from that average.

Differentiating between statistical retrieval and biological recognition requires granular analysis of individual circuits, frequency bands, and query compositions. This complexity suggests that mechanistic verification will be both essential and difficult for predictions involving higher biological stakes. Even in the case of the most basic biological rule, the model’s output is mediated by a distributed computational circuit rather than direct recognition. This finding implies that as task complexity increases, the correlation between model confidence and underlying biological evidence will become even more obscured.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

Bloomberg Tech Event Special | Bloomberg Tech 6/04/2026
Bloomberg

Bloomberg Tech Event Special | Bloomberg Tech 6/04/2026

This title indicates a special Bloomberg Tech broadcast scheduled for June 4, 2026. No specific content details are prov...

Anthropic’s Amodei on Pros and Cons of an AI Startup IPO
Bloomberg

Anthropic’s Amodei on Pros and Cons of an AI Startup IPO

Anthropic CEO Dario Amodei weighs the pros and cons of an IPO for his AI startup, highlighting the trade-offs between pu...

TechCrunch

Meta’s Oversight Board says account bans lack due process, transparency

Meta’s Oversight Board criticized account bans for lacking due process and transparency, citing inconsistent enforcement...

Fed's Daly Says Forward Guidance Could Be Misleading
Bloomberg

Fed's Daly Says Forward Guidance Could Be Misleading

Fed’s Daly warns forward guidance may be misleading or lack clarity.

TechCrunch

Meta rolls out a new AI creator assistant on Facebook

Meta launched an AI creator assistant on Facebook to streamline analytics and content brainstorming. Initially available...

TechCrunch

What to expect from WWDC 2026: Siri’s highly anticipated revamp and Apple Intelligence updates

WWDC 2026 promises a Siri revamp powered by Google’s Gemini and standalone app, plus AI agents in the App Store and Came...