TY - JOUR
T1 - Efficiently Integrate Large Language Models with visual perception: A survey from the training paradigm perspective
AU - MA, Xiaorui
AU - XIE, Haoran
AU - QIN, S. Joe
N1 - Publisher Copyright:
© 2025
PY - 2026/1
Y1 - 2026/1
N2 - Integrating Large Language Models (LLMs) with visual modalities has become a central focus in multimodal AI. However, the high computational cost associated with Vision Large Language Models (VLLMs) limits their accessibility, restricting broader use across research communities and real-world deployments. Based on a comprehensive review of 36 high-quality image-text VLLMs, this survey categorizes vision integration into three training paradigms, each employing distinct approaches to improve parameter efficiency. Single-stage Tuning combines pretraining with few-shot learning and achieves strong generalization using minimal labeled data by training only the Modality Integrator (MI). Two-stage Tuning enhances performance through instruction tuning, multi-task learning, or reinforcement learning while improving efficiency via selective MI training, reparameterization modules, and lightweight LLMs. Direct Adaptation skips pretraining and directly finetunes the model on vision-language tasks, achieving efficiency by embedding lightweight MIs into frozen LLMs. These training paradigms have enabled practical applications in areas such as visual assistance, mobile device deployment, medical analysis, agricultural monitoring, and autonomous driving under resource constraints. Despite these advances, each paradigm faces distinct limitations: Single-stage Tuning struggles with few-shot transfer, Two-stage Tuning remains computationally expensive, and Direct Adaptation shows limited generalization ability. Correspondingly, future progress will require more effective pretraining strategies for better few-shot transfer in Single-stage Tuning, optimized use of lightweight LLMs in Two-stage Tuning, and broader adoption of instruction tuning in Direct Adaptation to improve generalization under resource constraints.
AB - Integrating Large Language Models (LLMs) with visual modalities has become a central focus in multimodal AI. However, the high computational cost associated with Vision Large Language Models (VLLMs) limits their accessibility, restricting broader use across research communities and real-world deployments. Based on a comprehensive review of 36 high-quality image-text VLLMs, this survey categorizes vision integration into three training paradigms, each employing distinct approaches to improve parameter efficiency. Single-stage Tuning combines pretraining with few-shot learning and achieves strong generalization using minimal labeled data by training only the Modality Integrator (MI). Two-stage Tuning enhances performance through instruction tuning, multi-task learning, or reinforcement learning while improving efficiency via selective MI training, reparameterization modules, and lightweight LLMs. Direct Adaptation skips pretraining and directly finetunes the model on vision-language tasks, achieving efficiency by embedding lightweight MIs into frozen LLMs. These training paradigms have enabled practical applications in areas such as visual assistance, mobile device deployment, medical analysis, agricultural monitoring, and autonomous driving under resource constraints. Despite these advances, each paradigm faces distinct limitations: Single-stage Tuning struggles with few-shot transfer, Two-stage Tuning remains computationally expensive, and Direct Adaptation shows limited generalization ability. Correspondingly, future progress will require more effective pretraining strategies for better few-shot transfer in Single-stage Tuning, optimized use of lightweight LLMs in Two-stage Tuning, and broader adoption of instruction tuning in Direct Adaptation to improve generalization under resource constraints.
KW - Multimodal
KW - Large Language Model
KW - Vision-language model
KW - Parameter-efficient learning
KW - Instruction tuning
KW - Reinforcement learning
UR - https://www.scopus.com/pages/publications/105009512034
U2 - 10.1016/j.inffus.2025.103419
DO - 10.1016/j.inffus.2025.103419
M3 - Journal Article (refereed)
SN - 1566-2535
VL - 125
JO - Information Fusion
JF - Information Fusion
M1 - 103419
ER -