WSTC: Task-adaptive medical vision–language model with semantic tokens and dynamic alignment

  • Xiaolan GAO
  • , Jiaorao WANG
  • , Dan YANG*
  • *Corresponding author for this work

Research output: Journal PublicationsJournal Article (refereed)peer-review

Abstract

Recent medical vision–language models enable zero- and few-shot transfer, yet still depend on handcrafted prompts and task-specific heads. To address these limitations, we introduce Weakly Semantic-aware Task Conditioning (WSTC), a lightweight, plug-and-play framework that endows frozen vision–language models with strong adaptability across classification, retrieval, and segmentation. WSTC contains two modules: (i) the Weakly Semantic-Aware Module, which distills task cues from image patches into image-derived semantic tokens, without text-side prompt engineering or prompt tuning, and (ii) the Task-Conditioned Dynamic Alignment Module, which dynamically generates projection matrices to align vision–language embeddings under task guidance. Built on a frozen UniMedCLIP backbone, WSTC adapts to new tasks without tuning the backbone and requires a moderate trainable parameter budget for adaptation. Across both zero-shot and few-shot regimes, our method UniMedCLIP augmented with WSTC consistently outperforms strong frozen vision–language models (CLIP, MedCLIP, BioViL, PubMedCLIP, UniMedCLIP) and further surpasses adapter baselines such as Tip-Adapter and Meta-Adapter. These results highlight WSTC as a practical solution for scalable medical vision–language models under extreme data scarcity.

Original languageEnglish
Article number133008
JournalNeurocomputing
Volume676
Early online date13 Feb 2026
DOIs
Publication statusE-pub ahead of print - 13 Feb 2026

Bibliographical note

Publisher Copyright:
© 2026

Funding

This work was supported by the Hunan Provincial Department of Education Outstanding Youth Project (25B0451) and Fundamental Research Funds for the Central Universities (Grant No. 2025SMECP05). The implementation will be released upon acceptance.

Keywords

  • Conditional alignment
  • Medical vision–language models
  • Parameter-efficient adaptation
  • Task conditioning
  • Zero-/few-shot learning

Fingerprint

Dive into the research topics of 'WSTC: Task-adaptive medical vision–language model with semantic tokens and dynamic alignment'. Together they form a unique fingerprint.

Cite this