Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yui Oka

Wavelet-based Positional Representation for Long Context

Feb 04, 2025

Yui Oka, Taku Hasegawa, Kyosuke Nishida, Kuniko Saito

Abstract:In the realm of large-scale language models, a significant challenge arises when extrapolating sequences beyond the maximum allowable length. This is because the model's position embedding mechanisms are limited to positions encountered during training, thus preventing effective representation of positions in longer sequences. We analyzed conventional position encoding methods for long contexts and found the following characteristics. (1) When the representation dimension is regarded as the time axis, Rotary Position Embedding (RoPE) can be interpreted as a restricted wavelet transform using Haar-like wavelets. However, because it uses only a fixed scale parameter, it does not fully exploit the advantages of wavelet transforms, which capture the fine movements of non-stationary signals using multiple scales (window sizes). This limitation could explain why RoPE performs poorly in extrapolation. (2) Previous research as well as our own analysis indicates that Attention with Linear Biases (ALiBi) functions similarly to windowed attention, using windows of varying sizes. However, it has limitations in capturing deep dependencies because it restricts the receptive field of the model. From these insights, we propose a new position representation method that captures multiple scales (i.e., window sizes) by leveraging wavelet transforms without limiting the model's attention field. Experimental results show that this new method improves the performance of the model in both short and long contexts. In particular, our method allows extrapolation of position information without limiting the model's attention field.

* Accepted to ICLR 2025. 28 pages, 11 figures

Via

Access Paper or Ask Questions

Using Perturbed Length-aware Positional Encoding for Non-autoregressive Neural Machine Translation

Jul 29, 2021

Yui Oka, Katsuhito Sudoh, Satoshi Nakamura

Figure 1 for Using Perturbed Length-aware Positional Encoding for Non-autoregressive Neural Machine Translation

Figure 2 for Using Perturbed Length-aware Positional Encoding for Non-autoregressive Neural Machine Translation

Figure 3 for Using Perturbed Length-aware Positional Encoding for Non-autoregressive Neural Machine Translation

Figure 4 for Using Perturbed Length-aware Positional Encoding for Non-autoregressive Neural Machine Translation

Abstract:Non-autoregressive neural machine translation (NAT) usually employs sequence-level knowledge distillation using autoregressive neural machine translation (AT) as its teacher model. However, a NAT model often outputs shorter sentences than an AT model. In this work, we propose sequence-level knowledge distillation (SKD) using perturbed length-aware positional encoding and apply it to a student model, the Levenshtein Transformer. Our method outperformed a standard Levenshtein Transformer by 2.5 points in bilingual evaluation understudy (BLEU) at maximum in a WMT14 German to English translation. The NAT model output longer sentences than the baseline NAT models.

* 5 pages, 1 figures. Will be presented at ACL SRW 2021

Via

Access Paper or Ask Questions