2kenize: Tying Subword Sequences for Chinese Script Conversion

Pranav A , Isabelle Augenstein

3 Apr 2020

PDF Dataset Project Video

Abstract

Simplified Chinese to Traditional Chinese script conversion is a common preprocessing step in Chinese NLP. Despite this, current approaches have poor performance because they do not take into account that a simplified Chinese character can correspond to multiple traditional characters. Here, we propose a novel model that can disambiguate between mappings and convert between the two scripts. The model is based on subword segmentation, two language models, as well as a method for mapping between subword sequences. We further construct benchmark datasets for topic classification and script conversion. Our proposed method outperforms previous Chinese Character Conversion approaches by 6 points in accuracy. These results are further confirmed in a downstream application, where 2kenize is used to preprocess text for topic classification. An error analysis reveals that our method’s particular strengths are in dealing with code mixing and named entities.

Type

Conference paper

Publication

In Proceedings of the 2020 Annual Conference of the Association for Computational Linguistics (ACL)

Date

April, 2020