Chinese Word Embedding Learning with Limited Data

Shurui CHEN, Yufu CHEN, Yuyin LU, Yanghui RAO, Haoran XIE, Qing LI

Research output: Book Chapters | Papers in Conference ProceedingsConference paper (refereed)Researchpeer-review

Abstract

With the increasing demands of high-quality Chinese word embeddings for natural language processing, Chinese word embedding learning has attracted wide attention in recent years. Most of the existing research focused on capturing word semantics on large-scaled datasets. However, these methods are difficult to obtain effective word embeddings with limited data used for some specific fields. Observing the rich semantic information of Chinese fine-grained structures, we develop a model to fully fuse Chinese fine-grained structures as auxiliary information for word embedding learning. The proposed model views the word context information as a combination of word, character, pronunciation, and component. Besides, it adds the semantic relationship between pronunciations and components as a constraint to exploit auxiliary information comprehensively. Based on the decomposition of shifted positive pointwise mutual information matrix, our model could effectively generate Chinese word embeddings on small-scaled data. The results of word analogy, word similarity, and name entity recognition conducted on two public datasets show the effectiveness of our proposed model for capturing Chinese word semantics with limited data.
Original languageEnglish
Title of host publicationWeb and Big Data: 5th International Joint Conference, APWeb-WAIM 2021, Guangzhou, China, August 23–25, 2021, Proceedings, Part I
EditorsLeong HOU U, Marc SPANIOL, Yasushi SAKURAI, Junying CHEN
PublisherSpringer, Cham
Pages211-226
Number of pages16
ISBN (Electronic)9783030858964
ISBN (Print)9783030858957
DOIs
Publication statusE-pub ahead of print - 19 Aug 2021

Publication series

NameLecture Notes in Computer Science
PublisherSpringer
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Bibliographical note

We are grateful to the reviewers for their valuable comments. This work has been supported by the National Natural Science Foundation of China (61972426) and Guangdong Basic and Applied Basic Research Foundation (2020A1515010536).

Keywords

  • Chinese word embedding
  • Matrix factorization
  • Chinese fine-grained information

Fingerprint

Dive into the research topics of 'Chinese Word Embedding Learning with Limited Data'. Together they form a unique fingerprint.

Cite this