Projects per year
Abstract
The increasing size and complexity of pre-trained language models have demonstrated superior performance in many applications, but they usually require large training datasets to be adequately trained. Insufficient training sets could unexpectedly lead the model to overfit and fail to cope with complex tasks. Large language models (LLMs) trained on extensive corpora have prominent text generation capabilities, which improve the quality and quantity of data and play a crucial role in data augmentation. Specifically, distinctive prompt templates are given in personalised tasks to guide LLMs in generating the required content. Recently, promising retrieval-based techniques have further enhanced the expressive performance of LLMs in data augmentation by introducing external knowledge, enabling them to produce more grounded data. This survey provides an in-depth analysis of data augmentation in LLMs, classifying the techniques into Simple Augmentation, Prompt-based Augmentation, Retrieval-based Augmentation, and Hybrid Augmentation. Additionally, we conduct extensive experiments across four techniques, systematically compare and analyse their performance, and provide key insights. Following this, we connect data augmentation with three critical optimisation techniques. Finally, we introduce existing challenges and future opportunities that could further improve data augmentation. This survey provides researchers and practitioners of the text modality with avenues to address data scarcity and improve data quality, helping scholars understand the evolution of text data augmentation from traditional methods to the application of human-like generation and agent search in the era of LLMs.
| Original language | English |
|---|---|
| Article number | 35 |
| Number of pages | 66 |
| Journal | Artificial Intelligence Review |
| Volume | 59 |
| Issue number | 1 |
| Early online date | 11 Dec 2025 |
| DOIs | |
| Publication status | Published - Jan 2026 |
Bibliographical note
Publisher Copyright:© The Author(s) 2025.
Funding
This work was supported by the Research Impact Fund by the Research Grants Council of Hong Kong (Project No. 130272); a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (R1015-23); and the Faculty Research Grants (SDS24A8 and SDS24A19) and the Direct Grant (DR25E8) of Lingnan University, Hong Kong.
Keywords
- Data augmentation
- Large language models
- Text processing
- Natural language processing
Fingerprint
Dive into the research topics of 'Text data augmentation for large language models: a comprehensive survey of methods, challenges, and opportunities'. Together they form a unique fingerprint.Projects
- 4 Active
-
An Integrated Fake Financial News Detection Framework: Knowledge Graph, Large Language Models, Uncertainty Modeling, and Contrastive Learning
XIE, H. (PI)
1/07/25 → 30/06/27
Project: Grant Research
-
Automatic Weight Learning at Data-level and Task-level for Multitask Learning with the Application for Implicit Sentiment Analysis
XIE, H. (PI)
1/01/25 → 31/12/26
Project: Grant Research
-
Pretraining Language Model for Financial News Analysis
XIE, H. (PI)
1/01/25 → 31/12/26
Project: Grant Research