arXiv

CoughSense: Five-Class Respiratory Disease Classification via Whisper Encoder Fine-Tuning and Dual-Encoder Cross-Attention Fusion with Balanced Contrastive Learning

June 3, 2026 · Nikhil Vincent · Original Source

Title: CoughSense: Enhancing Five-Class Respiratory Disease Classification Through Whisper Encoder Fine-Tuning, Dual-Encoder Cross-Attention Fusion, and Balanced Contrastive Learning

Abstract

While automated cough analysis presents a viable avenue for affordable respiratory screening, current research is largely confined to binary detection of COVID-19. To develop a practical diagnostic tool capable of distinguishing among multiple respiratory ailments from a single smartphone recording, we introduce CoughSense. This system categorizes cough samples into five distinct categories: healthy, COVID-19, asthma or other respiratory conditions, bronchitis, and pneumonia.

Our approach utilizes a comprehensive dataset comprising 18,301 recordings sourced from four public repositories: Coswara, CoughVID, Virufy, and the West China Hospital Pediatric Cough Dataset. We employ the OpenAI Whisper encoder as the foundational backbone for disease classification. A pivotal innovation in our architecture is active-frame QKV attention pooling, which limits attention mechanisms to the initial 200 tokens out of the 1,500 available in Whisper’s encoder. This strategy effectively mitigates the "silence-dilution" issue, a common challenge where a three-second cough occupies only 150 tokens within Whisper’s 30-second input window.

To address significant challenges such as class imbalance (ranging from 19:1 ratios) and domain shifts across the four datasets, the training protocol incorporates several advanced techniques. These include WeightedRandomSampler, SpecAugment, supervised contrastive auxiliary loss, FiLM symptom conditioning, and gradient-reversal domain adaptation. Additionally, we implemented Balanced Mixup with forced minority pairing to further stabilize training.

The final model architecture features a dual-encoder setup that integrates the Whisper encoder with the OPERA-CT respiratory foundation model via cross-attention. The lightweight CoughSense variant, based on Whisper-tiny with 8.6 million parameters, achieved a balanced accuracy of 82.3% in five-fold cross-validation, yielding a macro-F1 score of 0.817 and an AUC of 0.941. These results outperformed an ImageNet-pretrained EfficientNet-B2 by 11.1 percentage points and a scratch-trained ViT by 29.6 points. Notably, all five classes demonstrated a recall rate exceeding 74%, with four classes surpassing 80%. The dual-encoder configuration further improved performance, reaching a balanced accuracy of 85.4%. Ablation studies identified active-frame pooling as the most impactful single component, contributing a 5.1-point gain, suggesting its broad utility for short-audio tasks utilizing Whisper as a backbone.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC