With the increasing demands of high-quality Chinese word embeddings for natural language processing, Chinese word embedding learning has attracted wide attention in recent years. Most of the existing research focused on capturing word semantics on large-scaled datasets. However, these methods are difficult to obtain effective word embeddings with limited data used for some specific fields. Observing the rich semantic information of Chinese fine-grained structures, we develop a model to fully fuse Chinese fine-grained structures as auxiliary information for word embedding learning. The proposed model views the word context information as a combination of word, character, pronunciation, and component. Besides, it adds the semantic relationship between pronunciations and components as a constraint to exploit auxiliary information comprehensively. Based on the decomposition of shifted positive pointwise mutual information matrix, our model could effectively generate Chinese word embeddings on small-scaled data. The results of word analogy, word similarity, and name entity recognition conducted on two public datasets show the effectiveness of our proposed model for capturing Chinese word semantics with limited data.
|Title of host publication||Web and Big Data: 5th International Joint Conference, APWeb-WAIM 2021, Guangzhou, China, August 23–25, 2021, Proceedings, Part I|
|Editors||Leong HOU U, Marc SPANIOL, Yasushi SAKURAI, Junying CHEN|
|Number of pages||16|
|Publication status||E-pub ahead of print - 19 Aug 2021|
|Name||Lecture Notes in Computer Science|
We are grateful to the reviewers for their valuable comments. This work has been supported by the National Natural Science Foundation of China (61972426) and Guangdong Basic and Applied Basic Research Foundation (2020A1515010536).
- Chinese word embedding
- Matrix factorization
- Chinese fine-grained information