Cantonese to Written Chinese Translation via HuggingFace Translation Pipeline

Raptor Yick Kan KWOK, Siu Kei AY YEUNG, Zongxi LI, Kevin HUNG

Research output: Book Chapters | Papers in Conference ProceedingsConference paper (refereed)Researchpeer-review

Abstract

Cantonese, a low-resource language [5] that has been used in Southeastern China for hundreds of years, with over 85 million native speakers worldwide, is poorly supported in the mainstream language model for existing translation platforms such as Baidu, Google and Bing. This paper presents a large parallel corpus of 130 thousand Cantonese and Written Chinese pairs. The data are used to train a translation model using the translation pipeline of the Hugging Face Transformers architecture, a dominant architecture for natural language processing nowadays [18]. The BLEU score and manual assessment evaluate the performance. The translation results achieve a BLEU score of 41.35and chrF++ score of 44.88on the entire validation set. The model also works reasonably well with long sentences of over 20 Chinese characters. It achieves a BLEU score of 48.61and chrF++ score of 39.87on long sentences. Those results are comparable with the existing Baidu Fanyi and Bing Translate. We also establish a Cantonese sentence evaluation metric to classify the quality of the source Cantonese sentence by professional translators. We then compare the BLEU and chrF++ scores with the corresponding evaluation score and found that the better the quality of the source sentence, the higher the BLEU and chrF++ scores. Last, we proved that our corpus enabled the Cantonese translation capability of the Chinese BART pre-Trained model.

Original languageEnglish
Title of host publicationNLPIR '23: Proceedings of the 2023 7th International Conference on Natural Language Processing and Information Retrieval
PublisherAssociation for Computing Machinery
Pages77-84
Number of pages8
ISBN (Electronic)9798400709227
DOIs
Publication statusPublished - Dec 2023
Externally publishedYes
Event7th International Conference on Natural Language Processing and Information Retrieval, NLPIR 2023 - Hybrid, Seoul, Korea, Republic of
Duration: 15 Dec 202317 Dec 2023

Conference

Conference7th International Conference on Natural Language Processing and Information Retrieval, NLPIR 2023
Country/TerritoryKorea, Republic of
CityHybrid, Seoul
Period15/12/2317/12/23

Bibliographical note

This work was supported in part by Hong Kong Metropolitan University R&D Funding R5097. We would like to thank Miss So Hei Yu for her hard work on translation.

Keywords

  • Cantonese
  • neural networks
  • translation
  • Written Chinese

Fingerprint

Dive into the research topics of 'Cantonese to Written Chinese Translation via HuggingFace Translation Pipeline'. Together they form a unique fingerprint.

Cite this