Cognitive-Linguistic Indicators of Depression in Online Communities: Analysed by DistilBERT and Holographic Reduced Representation
Title: Leveraging Cognitive-Linguistic Markers of Depression in Online Discourse: A Hybrid DistilBERT and Holographic Reduced Representation Approach
Abstract:
This research explores the efficacy of integrating transformer-derived embeddings with cognitively informed linguistic attributes to enhance the automated identification of depression within online textual data. Grounded in Beck’s Cognitive Theory of Depression, the study quantifies specific cognitive distortions as measurable features, such as the prevalence of first-person pronouns, the usage of absolutist terminology, and the presence of negative sentiment. These features were extracted from Reddit posts sourced from both depression-focused and general control communities.
The analysis utilizes a subset of the Kaggle Reddit Suicide and Depression Detection dataset to evaluate two distinct classification frameworks. The first serves as a baseline, employing Naive Bayes on TF-IDF embeddings. The second is a novel hybrid architecture that combines DistilBERT sentence embeddings with Holographic Reduced Representation (HRR) vectors, which encode the aforementioned cognitive-linguistic features, before passing them through a Logistic Regression classifier.
Performance metrics demonstrate a significant advantage for the hybrid model. It attained a macro F1 score of 0.94, compared to 0.80 for the TF-IDF baseline. Furthermore, 5-fold cross-validation results showed the hybrid approach improving the F1 score from 0.83 to 0.92 and increasing the Area Under the Curve (AUC) from 0.958 to 0.981.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





