With the increasing demands of high-quality Chinese word embeddings for natural language processing, Chinese word embedding learning has attracted wide attention in recent years. Most of the existing research focused on capturing word semantics on large-scaled datasets. However, these methods are difficult to obtain effective word embeddings with limited data used for some specific fields. Observing the rich semantic information of Chinese fine-grained structures, we develop a model to fully fuse Chinese fine-grained structures as auxiliary information for word embedding learning. The proposed model views the word context information as a combination of word, character, pronunciation, and component. Besides, it adds the semantic relationship between pronunciations and components as a constraint to exploit auxiliary information comprehensively. Based on the decomposition of shifted positive pointwise mutual information matrix, our model could effectively generate Chinese word embeddings on small-scaled data. The results of word analogy, word similarity, and name entity recognition conducted on two public datasets show the effectiveness of our proposed model for capturing Chinese word semantics with limited data.
|Title of host publication
|Web and Big Data: 5th International Joint Conference, APWeb-WAIM 2021, Guangzhou, China, August 23–25, 2021, Proceedings, Part I
|Leong HOU U, Marc SPANIOL, Yasushi SAKURAI, Junying CHEN
|Springer Science and Business Media Deutschland GmbH
|Number of pages
|Published - 2021
|5th International Joint Conference on Asia-Paciﬁc Web and Web-Age Information Management, APWeb-WAIM 2021 - Guangzhou, China
Duration: 23 Aug 2021 → 25 Aug 2021
|Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
|5th International Joint Conference on Asia-Paciﬁc Web and Web-Age Information Management, APWeb-WAIM 2021
|23/08/21 → 25/08/21
Bibliographical noteWe are grateful to the reviewers for their valuable comments. This work has been supported by the National Natural Science Foundation of China (61972426) and Guangdong Basic and Applied Basic Research Foundation (2020A1515010536).
© 2021, Springer Nature Switzerland AG.
- Chinese word embedding
- Matrix factorization
- Chinese fine-grained information