Echo: A Joint-Embedding Predictive Architecture for Speaker Diarization and Speech Recognition in a Shared Latent Space
Title: Echo: A Joint-Embedding Predictive Architecture for Speaker Diarization and Speech Recognition in a Shared Latent Space
Abstract:
This paper introduces Echo, a conceptual audio framework centered on a single 25-million-parameter Vision Transformer (ViT) encoder. Following pretraining via a Joint-Embedding Predictive Architecture (JEPA) objective, the model undergoes stage-specific specialization to encode speaker identity, phonetic information, and dynamic source routing within a unified 512-dimensional latent space. Notably, this architecture requires no per-task fine-tuning during deployment. Task-specific lightweight heads are employed to manage diarization (utilizing ArcFace and VBx) and dynamic source separation (achieved through null-target K-set prediction).
Evaluations on synthetic VoxCeleb2 mixtures, where the number of speakers (K) is unknown, demonstrate that the canonical stack achieves a blind Diarization Error Rate (DER) of 15.00%, a Permutation Invariant Training (PIT) separation accuracy of 97.80%, and a latent Signal-to-Distortion Ratio (SI-SDR) improvement of +9.52 dB. Furthermore, it exhibits a +53.50-point gap in speaker/content factorization on a held-out k-Nearest Neighbors (k-NN) probe. The primary contribution of Echo is not the establishment of new state-of-the-art (SOTA) benchmarks for individual tasks, but rather the successful joint coexistence of three distinct tasks within a single encoder of this specific footprint. We detail the design process incrementally, highlight unsuccessful experimental paths, and identify the structural limitations imposed by the Vector Quantization (VQ) bottleneck on end-to-end Automatic Speech Recognition (ASR), which currently constrains this proof-of-concept.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




