Abstract
Automatic depression detection (ADD) from multimodal interviews offers unobtrusive mental-health screening, yet current systems often miss faint behavioral cues, suffer under noisy inputs, and rely heavily on annotated data. We propose MSCDV, a transformer-based framework where three tightly integrated mechanisms form a unified learning loop. A multi-scale encoder captures local facial, vocal, and lexical nuances while maintaining global context; a perturbation-based consistency loss enforces alignment between clean and noised views, improving robustness; and a dual-view stepwise schedule presents complementary modality subsets in alternation, enhancing generalization under data scarcity. Because all components are co-optimized, improvements in one propagate through the system, producing features that are fine-grained, noise-aware, and data-efficient. On the DAIC-WOZ and E-DAIC benchmarks, MSCDV achieves mean F1-scores of 0.87 and 0.82, respectively, outperforming a wide range of state-of-the-art baselines without requiring additional supervision. Extensive ablations confirm that removing any single component results in performance degradation, underscoring their synergistic contribution to robust, real-world ADD.
| Original language | English |
|---|---|
| Article number | 108461 |
| Number of pages | 13 |
| Journal | Biomedical Signal Processing and Control |
| Volume | 112 |
| Early online date | 18 Aug 2025 |
| DOIs | |
| Publication status | Published - Feb 2026 |
Bibliographical note
Publisher Copyright:© 2025 Elsevier Ltd
Funding
This work was supported in part by the National Natural Science Foundation of China Project under Grant [62166042]; in part by the Natural Science Foundation of Xinjiang Uygur Autonomous Region of China under Grant [2021D01C076]; and in part by the Innovation Program for Doctoral Students of Xinjiang University, China under Grant [XJU2024BS095]; and in part by the Tianshan Talents Cultivation Program – Leading Talents for Scientific and Technological Innovation, China under Grant [2024TSYCLJ0002].
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 3 Good Health and Well-being
Keywords
- Affective computing
- Automatic depression detection (ADD)
- Deep learning
- Multimodal fusion (MMF)
- Transformer
Fingerprint
Dive into the research topics of 'A multi-scale transformer framework with consistency and dual-view for depression detection'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver