Skip to main navigation Skip to search Skip to main content

Efficiently Integrate Large Language Models with visual perception: A survey from the training paradigm perspective

Research output: Journal PublicationsJournal Article (refereed)peer-review

Abstract

Integrating Large Language Models (LLMs) with visual modalities has become a central focus in multimodal AI. However, the high computational cost associated with Vision Large Language Models (VLLMs) limits their accessibility, restricting broader use across research communities and real-world deployments. Based on a comprehensive review of 36 high-quality image-text VLLMs, this survey categorizes vision integration into three training paradigms, each employing distinct approaches to improve parameter efficiency. Single-stage Tuning combines pretraining with few-shot learning and achieves strong generalization using minimal labeled data by training only the Modality Integrator (MI). Two-stage Tuning enhances performance through instruction tuning, multi-task learning, or reinforcement learning while improving efficiency via selective MI training, reparameterization modules, and lightweight LLMs. Direct Adaptation skips pretraining and directly finetunes the model on vision-language tasks, achieving efficiency by embedding lightweight MIs into frozen LLMs. These training paradigms have enabled practical applications in areas such as visual assistance, mobile device deployment, medical analysis, agricultural monitoring, and autonomous driving under resource constraints. Despite these advances, each paradigm faces distinct limitations: Single-stage Tuning struggles with few-shot transfer, Two-stage Tuning remains computationally expensive, and Direct Adaptation shows limited generalization ability. Correspondingly, future progress will require more effective pretraining strategies for better few-shot transfer in Single-stage Tuning, optimized use of lightweight LLMs in Two-stage Tuning, and broader adoption of instruction tuning in Direct Adaptation to improve generalization under resource constraints.
Original languageEnglish
Article number103419
JournalInformation Fusion
Volume125
Early online date24 Jun 2025
DOIs
Publication statusPublished - Jan 2026

Bibliographical note

Publisher Copyright:
© 2025

Funding

The research described in this article has been supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China ( R1015-23 ); the Faculty Research Grant ( SDS24A8 ) and the Direct Grant ( DR25E8 ) of Lingnan University, Hong Kong ; and the 2023 Nanjing International/Hong Kong, Macao, and Taiwan Science and Technology Cooperation Program (Joint Research) ( 202308010 ).

Keywords

  • Multimodal
  • Large Language Model
  • Vision-language model
  • Parameter-efficient learning
  • Instruction tuning
  • Reinforcement learning

Fingerprint

Dive into the research topics of 'Efficiently Integrate Large Language Models with visual perception: A survey from the training paradigm perspective'. Together they form a unique fingerprint.

Cite this