Cognitive Stylometry: A Computational Study of Defamiliarization in Modern Chinese

Research output: Journal PublicationsJournal Article (refereed)peer-review

Abstract

Autoregressive language models generate text by predicting the next word from the preceding context. The regularities internalized from specific training data make this mechanism a useful proxy for historically situated readerly expectations, reflecting what earlier linguistic communities would find probable or meaningful. In this article, I pre-train a GPT model (223M parameters) on a broad corpus of Chinese texts (FineWeb Edu V2.1) and fine‑tune it on the collected writings of Mao Zedong (1893–1976) to simulate the evolving linguistic landscape of post‑1949 China. Identifying token sequences with the sharpest drops in perplexity—a measure of the model's surprise—allows me to identify the core phraseology of "Maospeak," the militant language style that developed from Mao's writings and pronouncements. A comparative analysis of modern Chinese fiction reveals how literature becomes unfamiliar to the fine-tuned model, generating perplexity spikes of increasing magnitude. The findings suggest a mechanism of attentional control: whereas propaganda backgrounds meaning through repetition (cognitive overfitting), literature foregrounds it through deviation (non-anomalous surprise). By visualizing token sequences as perplexity landscapes with peaks and valleys, the article reconceives style as a probabilistic phenomenon and showcases the potential of "cognitive stylometry" for literary theory and close reading.
Original languageEnglish
Number of pages17
JournalComputational Humanities Research
Early online date5 Dec 2025
DOIs
Publication statusE-pub ahead of print - 5 Dec 2025

Keywords

  • large language model
  • predictive coding
  • Perplexity
  • Chinese literature
  • information theory

Fingerprint

Dive into the research topics of 'Cognitive Stylometry: A Computational Study of Defamiliarization in Modern Chinese'. Together they form a unique fingerprint.

Cite this