A multi-scale transformer framework with consistency and dual-view for depression detection

  • Dongfang HAN
  • , Guo-Xing XIANG
  • , Jingyu ZHU
  • , Yuanyuan LIAO
  • , Jihong ZHU
  • , Askar HAMDULLA
  • , Turdi TOHTI*
  • *Corresponding author for this work

Research output: Journal PublicationsJournal Article (refereed)peer-review

Abstract

Automatic depression detection (ADD) from multimodal interviews offers unobtrusive mental-health screening, yet current systems often miss faint behavioral cues, suffer under noisy inputs, and rely heavily on annotated data. We propose MSCDV, a transformer-based framework where three tightly integrated mechanisms form a unified learning loop. A multi-scale encoder captures local facial, vocal, and lexical nuances while maintaining global context; a perturbation-based consistency loss enforces alignment between clean and noised views, improving robustness; and a dual-view stepwise schedule presents complementary modality subsets in alternation, enhancing generalization under data scarcity. Because all components are co-optimized, improvements in one propagate through the system, producing features that are fine-grained, noise-aware, and data-efficient. On the DAIC-WOZ and E-DAIC benchmarks, MSCDV achieves mean F1-scores of 0.87 and 0.82, respectively, outperforming a wide range of state-of-the-art baselines without requiring additional supervision. Extensive ablations confirm that removing any single component results in performance degradation, underscoring their synergistic contribution to robust, real-world ADD.
Original languageEnglish
Article number108461
Number of pages13
JournalBiomedical Signal Processing and Control
Volume112
Early online date18 Aug 2025
DOIs
Publication statusPublished - Feb 2026

Bibliographical note

Publisher Copyright:
© 2025 Elsevier Ltd

Funding

This work was supported in part by the National Natural Science Foundation of China Project under Grant [62166042]; in part by the Natural Science Foundation of Xinjiang Uygur Autonomous Region of China under Grant [2021D01C076]; and in part by the Innovation Program for Doctoral Students of Xinjiang University, China under Grant [XJU2024BS095]; and in part by the Tianshan Talents Cultivation Program – Leading Talents for Scientific and Technological Innovation, China under Grant [2024TSYCLJ0002].

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 3 - Good Health and Well-being
    SDG 3 Good Health and Well-being

Keywords

  • Affective computing
  • Automatic depression detection (ADD)
  • Deep learning
  • Multimodal fusion (MMF)
  • Transformer

Fingerprint

Dive into the research topics of 'A multi-scale transformer framework with consistency and dual-view for depression detection'. Together they form a unique fingerprint.

Cite this