Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Multi-Modal Multi-Granularity Tokenizer for Chu Bamboo Slip Scripts

Sep 02, 2024

Yingfa Chen, Chenlong Hu, Cong Feng, Chenyang Song, Shi Yu, Xu Han, Zhiyuan Liu, Maosong Sun

Figure 1 for Multi-Modal Multi-Granularity Tokenizer for Chu Bamboo Slip Scripts

Figure 2 for Multi-Modal Multi-Granularity Tokenizer for Chu Bamboo Slip Scripts

Figure 3 for Multi-Modal Multi-Granularity Tokenizer for Chu Bamboo Slip Scripts

Figure 4 for Multi-Modal Multi-Granularity Tokenizer for Chu Bamboo Slip Scripts

Share this with someone who'll enjoy it:

Abstract:This study presents a multi-modal multi-granularity tokenizer specifically designed for analyzing ancient Chinese scripts, focusing on the Chu bamboo slip (CBS) script used during the Spring and Autumn and Warring States period (771-256 BCE) in Ancient China. Considering the complex hierarchical structure of ancient Chinese scripts, where a single character may be a combination of multiple sub-characters, our tokenizer first adopts character detection to locate character boundaries, and then conducts character recognition at both the character and sub-character levels. Moreover, to support the academic community, we have also assembled the first large-scale dataset of CBSs with over 100K annotated character image scans. On the part-of-speech tagging task built on our dataset, using our tokenizer gives a 5.5% relative improvement in F1-score compared to mainstream sub-word tokenizers. Our work not only aids in further investigations of the specific script but also has the potential to advance research on other forms of ancient Chinese scripts.

* 12 pages, 3 figures

View paper on

Share this with someone who'll enjoy it:

Title:Multi-Modal Multi-Granularity Tokenizer for Chu Bamboo Slip Scripts

Paper and Code