arXiv

Multilingual Long-Form Speech Instruction Following: KIT's Submission to IWSLT 2026

June 4, 2026 · Enes Yavuz Ugan, Maike Z\"ufle, Yuka Ko, Supriti Sinhamahapatra, Fabian Retkowski, Seymanur Akti, Jan Niehues, Alexander Waibel · Original Source

Title: KIT’s Approach to Multilingual Long-Form Speech Instruction Following for IWSLT 2026

Abstract: The emergence of Large Language Models has shifted the paradigm from single-task and token-based multi-task architectures to instruction-driven systems. These newer models implicitly deduce the target language and specific task requirements directly from natural language prompts. This evolution is evident in the IWSLT Instruction Following Track, which this year expanded its scope by introducing novel challenges, including an unforeseen "surprise" task designed to test robustness and prevent overfitting to previously seen examples. This paper details KIT’s entry into the unconstrained Long and Short Instruction Following tracks. Our methodology employs a comprehensive data augmentation strategy that transforms short-form corpora into long-form training sets. This process involves concatenating segments, generating labels via LLMs, and applying cross-lingual translation, ultimately producing a dataset exceeding one million instances spanning four languages and six distinct tasks. Additionally, we demonstrate that while likelihood-based re-ranking is highly successful for Automatic Speech Recognition (ASR), it systematically harms performance on semantic tasks. This degradation occurs because the model spuriously favors candidates derived from segmented audio processing rather than holistic long-form inference. We resolve this issue by integrating likelihood scores with Minimum Bayes Risk decoding.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC