Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Pavel Rychlý

LombardoGraphia: Automatic Classification of Lombard Orthography Variants

Mar 30, 2026

Edoardo Signoroni, Pavel Rychlý

Abstract:Lombard, an underresourced language variety spoken by approximately 3.8 million people in Northern Italy and Southern Switzerland, lacks a unified orthographic standard. Multiple orthographic systems exist, creating challenges for NLP resource development and model training. This paper presents the first study of automatic Lombard orthography classification and LombardoGraphia, a curated corpus of 11,186 Lombard Wikipedia samples tagged across 9 orthographic variants, and models for automatic orthography classification. We curate the dataset, processing and filtering raw Wikipedia content to ensure text suitable for orthographic analysis. We train 24 traditional and neural classification models with various features and encoding levels. Our best models achieve 96.06% and 85.78% overall and average class accuracy, though performance on minority classes remains challenging due to data imbalance. Our work provides crucial infrastructure for building variety-aware NLP resources for Lombard.

* To be published at LREC 2026

Via

Access Paper or Ask Questions

A survey of neural-network-based methods utilising comparable data for finding translation equivalents

Oct 19, 2024

Michaela Denisová, Pavel Rychlý

Figure 1 for A survey of neural-network-based methods utilising comparable data for finding translation equivalents

Figure 2 for A survey of neural-network-based methods utilising comparable data for finding translation equivalents

Figure 3 for A survey of neural-network-based methods utilising comparable data for finding translation equivalents

Figure 4 for A survey of neural-network-based methods utilising comparable data for finding translation equivalents

Abstract:The importance of inducing bilingual dictionary components in many natural language processing (NLP) applications is indisputable. However, the dictionary compilation process requires extensive work and combines two disciplines, NLP and lexicography, while the former often omits the latter. In this paper, we present the most common approaches from NLP that endeavour to automatically induce one of the essential dictionary components, translation equivalents and focus on the neural-network-based methods using comparable data. We analyse them from a lexicographic perspective since their viewpoints are crucial for improving the described methods. Moreover, we identify the methods that integrate these viewpoints and can be further exploited in various applications that require them. This survey encourages a connection between the NLP and lexicography fields as the NLP field can benefit from lexicographic insights, and it serves as a helping and inspiring material for further research in the context of neural-network-based methods utilising comparable data.

Via

Access Paper or Ask Questions

Evaluation of Automatically Constructed Word Meaning Explanations

Feb 27, 2023

Marie Stará, Pavel Rychlý, Aleš Horák

Figure 1 for Evaluation of Automatically Constructed Word Meaning Explanations

Figure 2 for Evaluation of Automatically Constructed Word Meaning Explanations

Figure 3 for Evaluation of Automatically Constructed Word Meaning Explanations

Figure 4 for Evaluation of Automatically Constructed Word Meaning Explanations

Abstract:Preparing exact and comprehensive word meaning explanations is one of the key steps in the process of monolingual dictionary writing. In standard methodology, the explanations need an expert lexicographer who spends a substantial amount of time checking the consistency between the descriptive text and corpus evidence. In the following text, we present a new tool that derives explanations automatically based on collective information from very large corpora, particularly on word sketches. We also propose a quantitative evaluation of the constructed explanations, concentrating on explanations of nouns. The methodology is to a certain extent language independent; however, the presented verification is limited to Czech and English. We show that the presented approach allows to create explanations that contain data useful for understanding the word meaning in approximately 90% of cases. However, in many cases, the result requires post-editing to remove redundant information.

* Logically Speaking:A Festschrift for Marie Duzi, pp. 99-112, College Publications, UK, 2022, ISBN 978-1-84890-419-4
* preprint of a chapter published by College Publications at https://www.collegepublications.co.uk/tributes/?00049

Via

Access Paper or Ask Questions