Learning Chinese word embeddings from semantic and phonetic components

Fu Lee WANG, Yuyin LU, Gary CHENG*, Haoran XIE, Yanghui RAO

*Corresponding author for this work

Research output: Journal PublicationsJournal Article (refereed)peer-review

1 Citation (Scopus)

Abstract

As an important task in Asian language information processing, Chinese word embedding learning has attracted much attention recently. Based on either Skip-gram or CBOW, several methods have been proposed to exploit Chinese characters and sub-character components for learning Chinese word embeddings. Chinese characters are combinations of meaning, structure, and phonetic information (pinyin). However, previous works only cover the former two aspects and cannot effectively explore distinct semantics of characters. To address this issue, we develop a Pinyin-enhance Skip-gram model named rsp2vec, in addition to a radical and pinyin-enhanced Chinese word embedding (rPCWE) learning models based on CBOW. For our models, the phonetic information and semantic components of Chinese characters are encoded into embeddings simultaneously. Evaluations on word analogy reasoning, word relevance, text classification, named entity recognition, and case studies validate the effectiveness of our models.
Original languageEnglish
Pages (from-to)42805-42820
Number of pages16
JournalMultimedia Tools and Applications
Volume81
Issue number29
Early online date10 Aug 2022
DOIs
Publication statusPublished - Dec 2022

Bibliographical note

Publisher Copyright:
© 2022, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.

Funding

The work described in this paper was fully supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (UGC/ FDS16/E01/19), the One-off Special Fund from Central and Faculty Fund in Support of Research from 2019/20 to 2021/22 (MIT02/19-20), the Research Cluster Fund (RG 78/2019-2020R), the Interdisciplinary Research Scheme of the Dean’s Research Fund 2019-20 (FLASS/ DRF/IDS-2) of The Education University of Hong Kong, and the Lam Woo Research Fund (LWI20011) of Lingnan University, Hong Kong.

Keywords

  • Chinese word embedding
  • Phonetic information
  • Semantic components

Fingerprint

Dive into the research topics of 'Learning Chinese word embeddings from semantic and phonetic components'. Together they form a unique fingerprint.

Cite this