Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kei Uchiumi

Joint Optimization of Tokenization and Downstream Model

May 26, 2021

Tatsuya Hiraoka, Sho Takase, Kei Uchiumi, Atsushi Keyaki, Naoaki Okazaki

Figure 1 for Joint Optimization of Tokenization and Downstream Model

Figure 2 for Joint Optimization of Tokenization and Downstream Model

Figure 3 for Joint Optimization of Tokenization and Downstream Model

Figure 4 for Joint Optimization of Tokenization and Downstream Model

Abstract:Since traditional tokenizers are isolated from a downstream task and model, they cannot output an appropriate tokenization depending on the task and model, although recent studies imply that the appropriate tokenization improves the performance. In this paper, we propose a novel method to find an appropriate tokenization to a given downstream model by jointly optimizing a tokenizer and the model. The proposed method has no restriction except for using loss values computed by the downstream model to train the tokenizer, and thus, we can apply the proposed method to any NLP task. Moreover, the proposed method can be used to explore the appropriate tokenization for an already trained model as post-processing. Therefore, the proposed method is applicable to various situations. We evaluated whether our method contributes to improving performance on text classification in three languages and machine translation in eight language pairs. Experimental results show that our proposed method improves the performance by determining appropriate tokenizations.

* Accepted at ACL-IJCNLP 2021 Findings

Via

Access Paper or Ask Questions

How LSTM Encodes Syntax: Exploring Context Vectors and Semi-Quantization on Natural Text

Oct 01, 2020

Chihiro Shibata, Kei Uchiumi, Daichi Mochihashi

Figure 1 for How LSTM Encodes Syntax: Exploring Context Vectors and Semi-Quantization on Natural Text

Figure 2 for How LSTM Encodes Syntax: Exploring Context Vectors and Semi-Quantization on Natural Text

Figure 3 for How LSTM Encodes Syntax: Exploring Context Vectors and Semi-Quantization on Natural Text

Figure 4 for How LSTM Encodes Syntax: Exploring Context Vectors and Semi-Quantization on Natural Text

Abstract:Long Short-Term Memory recurrent neural network (LSTM) is widely used and known to capture informative long-term syntactic dependencies. However, how such information are reflected in its internal vectors for natural text has not yet been sufficiently investigated. We analyze them by learning a language model where syntactic structures are implicitly given. We empirically show that the context update vectors, i.e. outputs of internal gates, are approximately quantized to binary or ternary values to help the language model to count the depth of nesting accurately, as Suzgun et al. (2019) recently show for synthetic Dyck languages. For some dimensions in the context vector, we show that their activations are highly correlated with the depth of phrase structures, such as VP and NP. Moreover, with an $L_1$ regularization, we also found that it can accurately predict whether a word is inside a phrase structure or not from a small number of components of the context vector. Even for the case of learning from raw text, context vectors are shown to still correlate well with the phrase structures. Finally, we show that natural clusters of the functional words and the part of speeches that trigger phrases are represented in a small but principal subspace of the context-update vector of LSTM.

Via

Access Paper or Ask Questions