Text data augmentation for large language models: a comprehensive survey of methods, challenges, and opportunities

Research output: Journal PublicationsJournal Article (refereed)peer-review

Abstract

The increasing size and complexity of pre-trained language models have demonstrated superior performance in many applications, but they usually require large training datasets to be adequately trained. Insufficient training sets could unexpectedly lead the model to overfit and fail to cope with complex tasks. Large language models (LLMs) trained on extensive corpora have prominent text generation capabilities, which improve the quality and quantity of data and play a crucial role in data augmentation. Specifically, distinctive prompt templates are given in personalised tasks to guide LLMs in generating the required content. Recently, promising retrieval-based techniques have further enhanced the expressive performance of LLMs in data augmentation by introducing external knowledge, enabling them to produce more grounded data. This survey provides an in-depth analysis of data augmentation in LLMs, classifying the techniques into Simple Augmentation, Prompt-based Augmentation, Retrieval-based Augmentation, and Hybrid Augmentation. Additionally, we conduct extensive experiments across four techniques, systematically compare and analyse their performance, and provide key insights. Following this, we connect data augmentation with three critical optimisation techniques. Finally, we introduce existing challenges and future opportunities that could further improve data augmentation. This survey provides researchers and practitioners of the text modality with avenues to address data scarcity and improve data quality, helping scholars understand the evolution of text data augmentation from traditional methods to the application of human-like generation and agent search in the era of LLMs.
Original languageEnglish
Article number35
Number of pages66
JournalArtificial Intelligence Review
Volume59
Issue number1
Early online date11 Dec 2025
DOIs
Publication statusPublished - Jan 2026

Bibliographical note

Publisher Copyright:
© The Author(s) 2025.

Funding

This work was supported by the Research Impact Fund by the Research Grants Council of Hong Kong (Project No. 130272); a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (R1015-23); and the Faculty Research Grants (SDS24A8 and SDS24A19) and the Direct Grant (DR25E8) of Lingnan University, Hong Kong.

Keywords

  • Data augmentation
  • Large language models
  • Text processing
  • Natural language processing

Fingerprint

Dive into the research topics of 'Text data augmentation for large language models: a comprehensive survey of methods, challenges, and opportunities'. Together they form a unique fingerprint.

Cite this