Global News Digest

arXiv

Echo: A Joint-Embedding Predictive Architecture for Speaker Diarization and Speech Recognition in a Shared Latent Space

Title: Echo: A Joint-Embedding Predictive Architecture for Speaker Diarization and Speech Recognition in a Shared Latent Space

Abstract:

This paper introduces Echo, a conceptual audio framework centered on a single 25-million-parameter Vision Transformer (ViT) encoder. Following pretraining via a Joint-Embedding Predictive Architecture (JEPA) objective, the model undergoes stage-specific specialization to encode speaker identity, phonetic information, and dynamic source routing within a unified 512-dimensional latent space. Notably, this architecture requires no per-task fine-tuning during deployment. Task-specific lightweight heads are employed to manage diarization (utilizing ArcFace and VBx) and dynamic source separation (achieved through null-target K-set prediction).

Evaluations on synthetic VoxCeleb2 mixtures, where the number of speakers (K) is unknown, demonstrate that the canonical stack achieves a blind Diarization Error Rate (DER) of 15.00%, a Permutation Invariant Training (PIT) separation accuracy of 97.80%, and a latent Signal-to-Distortion Ratio (SI-SDR) improvement of +9.52 dB. Furthermore, it exhibits a +53.50-point gap in speaker/content factorization on a held-out k-Nearest Neighbors (k-NN) probe. The primary contribution of Echo is not the establishment of new state-of-the-art (SOTA) benchmarks for individual tasks, but rather the successful joint coexistence of three distinct tasks within a single encoder of this specific footprint. We detail the design process incrementally, highlight unsuccessful experimental paths, and identify the structural limitations imposed by the Vector Quantization (VQ) bottleneck on end-to-end Automatic Speech Recognition (ASR), which currently constrains this proof-of-concept.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Schroders Renewable Unit Targets AI Assets as Power Demand Soars
Bloomberg

Schroders Renewable Unit Targets AI Assets as Power Demand Soars

Schroders’ renewable unit targets AI infrastructure, pivoting to meet soaring energy demand from artificial intelligence...

State Street's Paglia on SBI Group Partnership, ETFs
Bloomberg

State Street's Paglia on SBI Group Partnership, ETFs

State Street's Paglia discusses the SBI Group partnership and ETFs, but the source text is missing. Please provide the a...

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’
Bloomberg

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’

Nvidia CEO Jensen Huang advocates for paying workers “as much as possible,” emphasizing maximum compensation. This stanc...

TSE Talking With Regulator For Easing ETF Listing Rules
Bloomberg

TSE Talking With Regulator For Easing ETF Listing Rules

The Tokyo Stock Exchange is discussing with regulators to ease ETF listing rules. This aims to simplify market access an...

S&P DJI CEO on Japan Markets, Mega IPOs
Bloomberg

S&P DJI CEO on Japan Markets, Mega IPOs

S&P DJI CEO discusses Japan's financial markets and major IPOs.