Abstract
Cantonese, a low-resource language [5] that has been used in Southeastern China for hundreds of years, with over 85 million native speakers worldwide, is poorly supported in the mainstream language model for existing translation platforms such as Baidu, Google and Bing. This paper presents a large parallel corpus of 130 thousand Cantonese and Written Chinese pairs. The data are used to train a translation model using the translation pipeline of the Hugging Face Transformers architecture, a dominant architecture for natural language processing nowadays [18]. The BLEU score and manual assessment evaluate the performance. The translation results achieve a BLEU score of 41.35and chrF++ score of 44.88on the entire validation set. The model also works reasonably well with long sentences of over 20 Chinese characters. It achieves a BLEU score of 48.61and chrF++ score of 39.87on long sentences. Those results are comparable with the existing Baidu Fanyi and Bing Translate. We also establish a Cantonese sentence evaluation metric to classify the quality of the source Cantonese sentence by professional translators. We then compare the BLEU and chrF++ scores with the corresponding evaluation score and found that the better the quality of the source sentence, the higher the BLEU and chrF++ scores. Last, we proved that our corpus enabled the Cantonese translation capability of the Chinese BART pre-Trained model.
Original language | English |
---|---|
Title of host publication | NLPIR '23: Proceedings of the 2023 7th International Conference on Natural Language Processing and Information Retrieval |
Publisher | Association for Computing Machinery |
Pages | 77-84 |
Number of pages | 8 |
ISBN (Electronic) | 9798400709227 |
DOIs | |
Publication status | Published - Dec 2023 |
Externally published | Yes |
Event | 7th International Conference on Natural Language Processing and Information Retrieval, NLPIR 2023 - Hybrid, Seoul, Korea, Republic of Duration: 15 Dec 2023 → 17 Dec 2023 |
Conference
Conference | 7th International Conference on Natural Language Processing and Information Retrieval, NLPIR 2023 |
---|---|
Country/Territory | Korea, Republic of |
City | Hybrid, Seoul |
Period | 15/12/23 → 17/12/23 |
Bibliographical note
This work was supported in part by Hong Kong Metropolitan University R&D Funding R5097. We would like to thank Miss So Hei Yu for her hard work on translation.Keywords
- Cantonese
- neural networks
- translation
- Written Chinese