Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jan Buys

A Systematic Analysis of Subwords and Cross-Lingual Transfer in Multilingual Translation

Mar 29, 2024

Francois Meyer, Jan Buys

Figure 1 for A Systematic Analysis of Subwords and Cross-Lingual Transfer in Multilingual Translation

Figure 2 for A Systematic Analysis of Subwords and Cross-Lingual Transfer in Multilingual Translation

Figure 3 for A Systematic Analysis of Subwords and Cross-Lingual Transfer in Multilingual Translation

Figure 4 for A Systematic Analysis of Subwords and Cross-Lingual Transfer in Multilingual Translation

Abstract:Multilingual modelling can improve machine translation for low-resource languages, partly through shared subword representations. This paper studies the role of subword segmentation in cross-lingual transfer. We systematically compare the efficacy of several subword methods in promoting synergy and preventing interference across different linguistic typologies. Our findings show that subword regularisation boosts synergy in multilingual modelling, whereas BPE more effectively facilitates transfer during cross-lingual fine-tuning. Notably, our results suggest that differences in orthographic word boundary conventions (the morphological granularity of written words) may impede cross-lingual transfer more significantly than linguistic unrelatedness. Our study confirms that decisions around subword modelling can be key to optimising the benefits of multilingual modelling.

Via

Access Paper or Ask Questions

Triples-to-isiXhosa (T2X): Addressing the Challenges of Low-Resource Agglutinative Data-to-Text Generation

Mar 12, 2024

Francois Meyer, Jan Buys

Figure 1 for Triples-to-isiXhosa (T2X): Addressing the Challenges of Low-Resource Agglutinative Data-to-Text Generation

Figure 2 for Triples-to-isiXhosa (T2X): Addressing the Challenges of Low-Resource Agglutinative Data-to-Text Generation

Figure 3 for Triples-to-isiXhosa (T2X): Addressing the Challenges of Low-Resource Agglutinative Data-to-Text Generation

Figure 4 for Triples-to-isiXhosa (T2X): Addressing the Challenges of Low-Resource Agglutinative Data-to-Text Generation

Abstract:Most data-to-text datasets are for English, so the difficulties of modelling data-to-text for low-resource languages are largely unexplored. In this paper we tackle data-to-text for isiXhosa, which is low-resource and agglutinative. We introduce Triples-to-isiXhosa (T2X), a new dataset based on a subset of WebNLG, which presents a new linguistic context that shifts modelling demands to subword-driven techniques. We also develop an evaluation framework for T2X that measures how accurately generated text describes the data. This enables future users of T2X to go beyond surface-level metrics in evaluation. On the modelling side we explore two classes of methods - dedicated data-to-text models trained from scratch and pretrained language models (PLMs). We propose a new dedicated architecture aimed at agglutinative data-to-text, the Subword Segmental Pointer Generator (SSPG). It jointly learns to segment words and copy entities, and outperforms existing dedicated models for 2 agglutinative languages (isiXhosa and Finnish). We investigate pretrained solutions for T2X, which reveals that standard PLMs come up short. Fine-tuning machine translation models emerges as the best method overall. These findings underscore the distinct challenge presented by T2X: neither well-established data-to-text architectures nor customary pretrained methodologies prove optimal. We conclude with a qualitative analysis of generation errors and an ablation study.

Via

Access Paper or Ask Questions

Multipath parsing in the brain

Jan 31, 2024

Berta Franzluebbers, Donald Dunagan, Miloš Stanojević, Jan Buys, John T. Hale

Figure 1 for Multipath parsing in the brain

Figure 2 for Multipath parsing in the brain

Figure 3 for Multipath parsing in the brain

Figure 4 for Multipath parsing in the brain

Abstract:Humans understand sentences word-by-word, in the order that they hear them. This incrementality entails resolving temporary ambiguities about syntactic relationships. We investigate how humans process these syntactic ambiguities by correlating predictions from incremental generative dependency parsers with timecourse data from people undergoing functional neuroimaging while listening to an audiobook. In particular, we compare competing hypotheses regarding the number of developing syntactic analyses in play during word-by-word comprehension: one vs more than one. This comparison involves evaluating syntactic surprisal from a state-of-the-art dependency parser with LLM-adapted encodings against an existing fMRI dataset. In both English and Chinese data, we find evidence for multipath parsing. Brain regions associated with this multipath effect include bilateral superior temporal gyrus.

* 15 pages

Via

Access Paper or Ask Questions

Subword Segmental Machine Translation: Unifying Segmentation and Target Sentence Generation

May 11, 2023

Francois Meyer, Jan Buys

Abstract:Subword segmenters like BPE operate as a preprocessing step in neural machine translation and other (conditional) language models. They are applied to datasets before training, so translation or text generation quality relies on the quality of segmentations. We propose a departure from this paradigm, called subword segmental machine translation (SSMT). SSMT unifies subword segmentation and MT in a single trainable model. It learns to segment target sentence words while jointly learning to generate target sentences. To use SSMT during inference we propose dynamic decoding, a text generation algorithm that adapts segmentations as it generates translations. Experiments across 6 translation directions show that SSMT improves chrF scores for morphologically rich agglutinative languages. Gains are strongest in the very low-resource scenario. SSMT also learns subwords that are closer to morphemes compared to baselines and proves more robust on a test set constructed for evaluating morphological compositional generalisation.

Via

Access Paper or Ask Questions

Subword Segmental Language Modelling for Nguni Languages

Oct 12, 2022

Francois Meyer, Jan Buys

Figure 1 for Subword Segmental Language Modelling for Nguni Languages

Figure 2 for Subword Segmental Language Modelling for Nguni Languages

Figure 3 for Subword Segmental Language Modelling for Nguni Languages

Figure 4 for Subword Segmental Language Modelling for Nguni Languages

Abstract:Subwords have become the standard units of text in NLP, enabling efficient open-vocabulary models. With algorithms like byte-pair encoding (BPE), subword segmentation is viewed as a preprocessing step applied to the corpus before training. This can lead to sub-optimal segmentations for low-resource languages with complex morphologies. We propose a subword segmental language model (SSLM) that learns how to segment words while being trained for autoregressive language modelling. By unifying subword segmentation and language modelling, our model learns subwords that optimise LM performance. We train our model on the 4 Nguni languages of South Africa. These are low-resource agglutinative languages, so subword information is critical. As an LM, SSLM outperforms existing approaches such as BPE-based models on average across the 4 languages. Furthermore, it outperforms standard subword segmenters on unsupervised morphological segmentation. We also train our model as a word-level sequence model, resulting in an unsupervised morphological segmenter that outperforms existing methods by a large margin for all 4 languages. Our results show that learning subword segmentation is an effective alternative to existing subword segmenters, enabling the model to discover morpheme-like subwords that improve its LM capabilities.

Via

Access Paper or Ask Questions

Low-Resource Language Modelling of South African Languages

Apr 01, 2021

Stuart Mesham, Luc Hayward, Jared Shapiro, Jan Buys

Figure 1 for Low-Resource Language Modelling of South African Languages

Figure 2 for Low-Resource Language Modelling of South African Languages

Figure 3 for Low-Resource Language Modelling of South African Languages

Figure 4 for Low-Resource Language Modelling of South African Languages

Abstract:Language models are the foundation of current neural network-based models for natural language understanding and generation. However, research on the intrinsic performance of language models on African languages has been extremely limited, which is made more challenging by the lack of large or standardised training and evaluation sets that exist for English and other high-resource languages. In this paper, we evaluate the performance of open-vocabulary language models on low-resource South African languages, using byte-pair encoding to handle the rich morphology of these languages. We evaluate different variants of n-gram models, feedforward neural networks, recurrent neural networks (RNNs), and Transformers on small-scale datasets. Overall, well-regularized RNNs give the best performance across two isiZulu and one Sepedi datasets. Multilingual training further improves performance on these datasets. We hope that this research will open new avenues for research into multilingual and low-resource language modelling for African languages.

* AfricaNLP workshop at EACL 2021

Via

Access Paper or Ask Questions

Canonical and Surface Morphological Segmentation for Nguni Languages

Apr 01, 2021

Tumi Moeng, Sheldon Reay, Aaron Daniels, Jan Buys

Figure 1 for Canonical and Surface Morphological Segmentation for Nguni Languages

Figure 2 for Canonical and Surface Morphological Segmentation for Nguni Languages

Figure 3 for Canonical and Surface Morphological Segmentation for Nguni Languages

Figure 4 for Canonical and Surface Morphological Segmentation for Nguni Languages

Abstract:Morphological Segmentation involves decomposing words into morphemes, the smallest meaning-bearing units of language. This is an important NLP task for morphologically-rich agglutinative languages such as the Southern African Nguni language group. In this paper, we investigate supervised and unsupervised models for two variants of morphological segmentation: canonical and surface segmentation. We train sequence-to-sequence models for canonical segmentation, where the underlying morphemes may not be equal to the surface form of the word, and Conditional Random Fields (CRF) for surface segmentation. Transformers outperform LSTMs with attention on canonical segmentation, obtaining an average F1 score of 72.5% across 4 languages. Feature-based CRFs outperform bidirectional LSTM-CRFs to obtain an average of 97.1% F1 on surface segmentation. In the unsupervised setting, an entropy-based approach using a character-level LSTM language model fails to outperforms a Morfessor baseline, while on some of the languages neither approach performs much better than a random baseline. We hope that the high performance of the supervised segmentation models will help to facilitate the development of better NLP tools for Nguni languages.

* AfricaNLP workshop at EACL 2021

Via

Access Paper or Ask Questions

BottleSum: Unsupervised and Self-supervised Sentence Summarization using the Information Bottleneck Principle

Sep 20, 2019

Peter West, Ari Holtzman, Jan Buys, Yejin Choi

Figure 1 for BottleSum: Unsupervised and Self-supervised Sentence Summarization using the Information Bottleneck Principle

Figure 2 for BottleSum: Unsupervised and Self-supervised Sentence Summarization using the Information Bottleneck Principle

Figure 3 for BottleSum: Unsupervised and Self-supervised Sentence Summarization using the Information Bottleneck Principle

Figure 4 for BottleSum: Unsupervised and Self-supervised Sentence Summarization using the Information Bottleneck Principle

Abstract:The principle of the Information Bottleneck (Tishby et al. 1999) is to produce a summary of information X optimized to predict some other relevant information Y. In this paper, we propose a novel approach to unsupervised sentence summarization by mapping the Information Bottleneck principle to a conditional language modelling objective: given a sentence, our approach seeks a compressed sentence that can best predict the next sentence. Our iterative algorithm under the Information Bottleneck objective searches gradually shorter subsequences of the given sentence while maximizing the probability of the next sentence conditioned on the summary. Using only pretrained language models with no direct supervision, our approach can efficiently perform extractive sentence summarization over a large corpus. Building on our unsupervised extractive summarization (BottleSumEx), we then present a new approach to self-supervised abstractive summarization (BottleSumSelf), where a transformer-based language model is trained on the output summaries of our unsupervised method. Empirical results demonstrate that our extractive method outperforms other unsupervised models on multiple automatic metrics. In addition, we find that our self-supervised abstractive model outperforms unsupervised baselines (including our own) by human evaluation along multiple attributes.

Via

Access Paper or Ask Questions

Neural Text Generation from Rich Semantic Representations

Apr 25, 2019

Valerie Hajdik, Jan Buys, Michael W. Goodman, Emily M. Bender

Figure 1 for Neural Text Generation from Rich Semantic Representations

Figure 2 for Neural Text Generation from Rich Semantic Representations

Figure 3 for Neural Text Generation from Rich Semantic Representations

Figure 4 for Neural Text Generation from Rich Semantic Representations

Abstract:We propose neural models to generate high-quality text from structured representations based on Minimal Recursion Semantics (MRS). MRS is a rich semantic representation that encodes more precise semantic detail than other representations such as Abstract Meaning Representation (AMR). We show that a sequence-to-sequence model that maps a linearization of Dependency MRS, a graph-based representation of MRS, to English text can achieve a BLEU score of 66.11 when trained on gold data. The performance can be improved further using a high-precision, broad coverage grammar-based parser to generate a large silver training corpus, achieving a final BLEU score of 77.17 on the full test set, and 83.37 on the subset of test data most closely matching the silver data domain. Our results suggest that MRS-based representations are a good choice for applications that need both structured semantics and the ability to produce natural language text as output.

* NAACL 2019

Via

Access Paper or Ask Questions

The Curious Case of Neural Text Degeneration

Apr 22, 2019

Ari Holtzman, Jan Buys, Maxwell Forbes, Yejin Choi

Figure 1 for The Curious Case of Neural Text Degeneration

Figure 2 for The Curious Case of Neural Text Degeneration

Figure 3 for The Curious Case of Neural Text Degeneration

Figure 4 for The Curious Case of Neural Text Degeneration

Abstract:Despite considerable advancements with deep neural language models, the enigma of neural text degeneration persists when these models are tested as text generators. The counter-intuitive empirical observation is that even though the use of likelihood as training objective leads to high quality models for a broad range of language understanding tasks, using likelihood as a decoding objective leads to text that is bland and strangely repetitive. In this paper, we reveal surprising distributional differences between human text and machine text. In addition, we find that decoding strategies alone can dramatically effect the quality of machine text, even when generated from exactly the same neural language model. Our findings motivate Nucleus Sampling, a simple but effective method to draw the best out of neural generation. By sampling text from the dynamic nucleus of the probability distribution, which allows for diversity while effectively truncating the less reliable tail of the distribution, the resulting text better demonstrates the quality of human text, yielding enhanced diversity without sacrificing fluency and coherence.

* 9 pages

Via

Access Paper or Ask Questions