Abstract
Recent medical vision–language models enable zero- and few-shot transfer, yet still depend on handcrafted prompts and task-specific heads. To address these limitations, we introduce Weakly Semantic-aware Task Conditioning (WSTC), a lightweight, plug-and-play framework that endows frozen vision–language models with strong adaptability across classification, retrieval, and segmentation. WSTC contains two modules: (i) the Weakly Semantic-Aware Module, which distills task cues from image patches into image-derived semantic tokens, without text-side prompt engineering or prompt tuning, and (ii) the Task-Conditioned Dynamic Alignment Module, which dynamically generates projection matrices to align vision–language embeddings under task guidance. Built on a frozen UniMedCLIP backbone, WSTC adapts to new tasks without tuning the backbone and requires a moderate trainable parameter budget for adaptation. Across both zero-shot and few-shot regimes, our method UniMedCLIP augmented with WSTC consistently outperforms strong frozen vision–language models (CLIP, MedCLIP, BioViL, PubMedCLIP, UniMedCLIP) and further surpasses adapter baselines such as Tip-Adapter and Meta-Adapter. These results highlight WSTC as a practical solution for scalable medical vision–language models under extreme data scarcity.
| Original language | English |
|---|---|
| Article number | 133008 |
| Journal | Neurocomputing |
| Volume | 676 |
| Early online date | 13 Feb 2026 |
| DOIs | |
| Publication status | E-pub ahead of print - 13 Feb 2026 |
Bibliographical note
Publisher Copyright:© 2026
Funding
This work was supported by the Hunan Provincial Department of Education Outstanding Youth Project (25B0451) and Fundamental Research Funds for the Central Universities (Grant No. 2025SMECP05). The implementation will be released upon acceptance.
Keywords
- Conditional alignment
- Medical vision–language models
- Parameter-efficient adaptation
- Task conditioning
- Zero-/few-shot learning
Fingerprint
Dive into the research topics of 'WSTC: Task-adaptive medical vision–language model with semantic tokens and dynamic alignment'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver