Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Marko Robnik-Šikonja

Solving Word-Sense Disambiguation and Word-Sense Induction with Dictionary Examples

Mar 06, 2025

Tadej Škvorc, Marko Robnik-Šikonja

Abstract:Many less-resourced languages struggle with a lack of large, task-specific datasets that are required for solving relevant tasks with modern transformer-based large language models (LLMs). On the other hand, many linguistic resources, such as dictionaries, are rarely used in this context despite their large information contents. We show how LLMs can be used to extend existing language resources in less-resourced languages for two important tasks: word-sense disambiguation (WSD) and word-sense induction (WSI). We approach the two tasks through the related but much more accessible word-in-context (WiC) task where, given a pair of sentences and a target word, a classification model is tasked with predicting whether the sense of a given word differs between sentences. We demonstrate that a well-trained model for this task can distinguish between different word senses and can be adapted to solve the WSD and WSI tasks. The advantage of using the WiC task, instead of directly predicting senses, is that the WiC task does not need pre-constructed sense inventories with a sufficient number of examples for each sense, which are rarely available in less-resourced languages. We show that sentence pairs for the WiC task can be successfully generated from dictionary examples using LLMs. The resulting prediction models outperform existing models on WiC, WSD, and WSI tasks. We demonstrate our methodology on the Slovene language, where a monolingual dictionary is available, but word-sense resources are tiny.

* 12 pages, 1 figure

Via

Access Paper or Ask Questions

Neural spell-checker: Beyond words with synthetic data generation

Oct 30, 2024

Matej Klemen, Martin Božič, Špela Arhar Holdt, Marko Robnik-Šikonja

Abstract:Spell-checkers are valuable tools that enhance communication by identifying misspelled words in written texts. Recent improvements in deep learning, and in particular in large language models, have opened new opportunities to improve traditional spell-checkers with new functionalities that not only assess spelling correctness but also the suitability of a word for a given context. In our work, we present and compare two new spell-checkers and evaluate them on synthetic, learner, and more general-domain Slovene datasets. The first spell-checker is a traditional, fast, word-based approach, based on a morphological lexicon with a significantly larger word list compared to existing spell-checkers. The second approach uses a language model trained on a large corpus with synthetically inserted errors. We present the training data construction strategies, which turn out to be a crucial component of neural spell-checkers. Further, the proposed neural model significantly outperforms all existing spell-checkers for Slovene in both precision and recall.

* Camera-ready version. Accepted to TSD 2024

Via

Access Paper or Ask Questions

Sarcasm Detection in a Less-Resourced Language

Oct 16, 2024

Lazar Đoković, Marko Robnik-Šikonja

Figure 1 for Sarcasm Detection in a Less-Resourced Language

Figure 2 for Sarcasm Detection in a Less-Resourced Language

Abstract:The sarcasm detection task in natural language processing tries to classify whether an utterance is sarcastic or not. It is related to sentiment analysis since it often inverts surface sentiment. Because sarcastic sentences are highly dependent on context, and they are often accompanied by various non-verbal cues, the task is challenging. Most of related work focuses on high-resourced languages like English. To build a sarcasm detection dataset for a less-resourced language, such as Slovenian, we leverage two modern techniques: a machine translation specific medium-size transformer model, and a very large generative language model. We explore the viability of translated datasets and how the size of a pretrained transformer affects its ability to detect sarcasm. We train ensembles of detection models and evaluate models' performance. The results show that larger models generally outperform smaller ones and that ensembling can slightly improve sarcasm detection performance. Our best ensemble approach achieves an $\text{F}_1$-score of 0.765 which is close to annotators' agreement in the source language.

* Proceedings of the 27th International Multiconference INFORMATION SOCIETY - IS 2024, Volume A, 2024, pages 19-22
* 4 pages, published in the Slovenian Conference on Artificial Intelligence

Via

Access Paper or Ask Questions

Generative Model for Less-Resourced Language with 1 billion parameters

Oct 09, 2024

Domen Vreš, Martin Božič, Aljaž Potočnik, Tomaž Martinčič, Marko Robnik-Šikonja

Figure 1 for Generative Model for Less-Resourced Language with 1 billion parameters

Figure 2 for Generative Model for Less-Resourced Language with 1 billion parameters

Figure 3 for Generative Model for Less-Resourced Language with 1 billion parameters

Figure 4 for Generative Model for Less-Resourced Language with 1 billion parameters

Abstract:Large language models (LLMs) are a basic infrastructure for modern natural language processing. Many commercial and open-source LLMs exist for English, e.g., ChatGPT, Llama, Falcon, and Mistral. As these models are trained on mostly English texts, their fluency and knowledge of low-resource languages and societies are superficial. We present the development of large generative language models for a less-resourced language. GaMS 1B - Generative Model for Slovene with 1 billion parameters was created by continuing pretraining of the existing English OPT model. We developed a new tokenizer adapted to Slovene, Croatian, and English languages and used embedding initialization methods FOCUS and WECHSEL to transfer the embeddings from the English OPT model. We evaluate our models on several classification datasets from the Slovene suite of benchmarks and generative sentence simplification task SENTA. We only used a few-shot in-context learning of our models, which are not yet instruction-tuned. For classification tasks, in this mode, the generative models lag behind the existing Slovene BERT-type models fine-tuned for specific tasks. On a sentence simplification task, the GaMS models achieve comparable or better performance than the GPT-3.5-Turbo model.

Via

Access Paper or Ask Questions

Retrieval-augmented code completion for local projects using large language models

Aug 09, 2024

Marko Hostnik, Marko Robnik-Šikonja

Figure 1 for Retrieval-augmented code completion for local projects using large language models

Figure 2 for Retrieval-augmented code completion for local projects using large language models

Figure 3 for Retrieval-augmented code completion for local projects using large language models

Figure 4 for Retrieval-augmented code completion for local projects using large language models

Abstract:The use of large language models (LLMs) is becoming increasingly widespread among software developers. However, privacy and computational requirements are problematic with commercial solutions and the use of LLMs. In this work, we focus on using LLMs with around 160 million parameters that are suitable for local execution and augmentation with retrieval from local projects. We train two models based on the transformer architecture, the generative model GPT-2 and the retrieval-adapted RETRO model, on open-source Python files, and empirically evaluate and compare them, confirming the benefits of vector embedding based retrieval. Further, we improve our models' performance with In-context retrieval-augmented generation, which retrieves code snippets based on the Jaccard similarity of tokens. We evaluate In-context retrieval-augmented generation on larger models and conclude that, despite its simplicity, the approach is more suitable than using the RETRO architecture. We highlight the key role of proper tokenization in achieving the full potential of LLMs in code completion.

* 28 pages, 14 figures

Via

Access Paper or Ask Questions

Measuring Catastrophic Forgetting in Cross-Lingual Transfer Paradigms: Exploring Tuning Strategies

Sep 12, 2023

Boshko Koloski, Blaž Škrlj, Marko Robnik-Šikonja, Senja Pollak

Figure 1 for Measuring Catastrophic Forgetting in Cross-Lingual Transfer Paradigms: Exploring Tuning Strategies

Figure 2 for Measuring Catastrophic Forgetting in Cross-Lingual Transfer Paradigms: Exploring Tuning Strategies

Figure 3 for Measuring Catastrophic Forgetting in Cross-Lingual Transfer Paradigms: Exploring Tuning Strategies

Figure 4 for Measuring Catastrophic Forgetting in Cross-Lingual Transfer Paradigms: Exploring Tuning Strategies

Abstract:The cross-lingual transfer is a promising technique to solve tasks in less-resourced languages. In this empirical study, we compare two fine-tuning approaches combined with zero-shot and full-shot learning approaches for large language models in a cross-lingual setting. As fine-tuning strategies, we compare parameter-efficient adapter methods with fine-tuning of all parameters. As cross-lingual transfer strategies, we compare the intermediate-training (\textit{IT}) that uses each language sequentially and cross-lingual validation (\textit{CLV}) that uses a target language already in the validation phase of fine-tuning. We assess the success of transfer and the extent of catastrophic forgetting in a source language due to cross-lingual transfer, i.e., how much previously acquired knowledge is lost when we learn new information in a different language. The results on two different classification problems, hate speech detection and product reviews, each containing datasets in several languages, show that the \textit{IT} cross-lingual strategy outperforms \textit{CLV} for the target language. Our findings indicate that, in the majority of cases, the \textit{CLV} strategy demonstrates superior retention of knowledge in the base language (English) compared to the \textit{IT} strategy, when evaluating catastrophic forgetting in multiple cross-lingual transfers.

Via

Access Paper or Ask Questions

One model to rule them all: ranking Slovene summarizers

Jun 20, 2023

Aleš Žagar, Marko Robnik-Šikonja

Figure 1 for One model to rule them all: ranking Slovene summarizers

Figure 2 for One model to rule them all: ranking Slovene summarizers

Figure 3 for One model to rule them all: ranking Slovene summarizers

Figure 4 for One model to rule them all: ranking Slovene summarizers

Abstract:Text summarization is an essential task in natural language processing, and researchers have developed various approaches over the years, ranging from rule-based systems to neural networks. However, there is no single model or approach that performs well on every type of text. We propose a system that recommends the most suitable summarization model for a given text. The proposed system employs a fully connected neural network that analyzes the input content and predicts which summarizer should score the best in terms of ROUGE score for a given input. The meta-model selects among four different summarization models, developed for the Slovene language, using different properties of the input, in particular its Doc2Vec document representation. The four Slovene summarization models deal with different challenges associated with text summarization in a less-resourced language. We evaluate the proposed SloMetaSum model performance automatically and parts of it manually. The results show that the system successfully automates the step of manually selecting the best model.

Via

Access Paper or Ask Questions

Detection of depression on social networks using transformers and ensembles

May 09, 2023

Ilija Tavchioski, Marko Robnik-Šikonja, Senja Pollak

Figure 1 for Detection of depression on social networks using transformers and ensembles

Figure 2 for Detection of depression on social networks using transformers and ensembles

Figure 3 for Detection of depression on social networks using transformers and ensembles

Figure 4 for Detection of depression on social networks using transformers and ensembles

Abstract:As the impact of technology on our lives is increasing, we witness increased use of social media that became an essential tool not only for communication but also for sharing information with community about our thoughts and feelings. This can be observed also for people with mental health disorders such as depression where they use social media for expressing their thoughts and asking for help. This opens a possibility to automatically process social media posts and detect signs of depression. We build several large pre-trained language model based classifiers for depression detection from social media posts. Besides fine-tuning BERT, RoBERTA, BERTweet, and mentalBERT were also construct two types of ensembles. We analyze the performance of our models on two data sets of posts from social platforms Reddit and Twitter, and investigate also the performance of transfer learning across the two data sets. The results show that transformer ensembles improve over the single transformer-based classifiers.

Via

Access Paper or Ask Questions

Feature construction using explanations of individual predictions

Jan 23, 2023

Boštjan Vouk, Matej Guid, Marko Robnik-Šikonja

Figure 1 for Feature construction using explanations of individual predictions

Figure 2 for Feature construction using explanations of individual predictions

Figure 3 for Feature construction using explanations of individual predictions

Figure 4 for Feature construction using explanations of individual predictions

Abstract:Feature construction can contribute to comprehensibility and performance of machine learning models. Unfortunately, it usually requires exhaustive search in the attribute space or time-consuming human involvement to generate meaningful features. We propose a novel heuristic approach for reducing the search space based on aggregation of instance-based explanations of predictive models. The proposed Explainable Feature Construction (EFC) methodology identifies groups of co-occurring attributes exposed by popular explanation methods, such as IME and SHAP. We empirically show that reducing the search to these groups significantly reduces the time of feature construction using logical, relational, Cartesian, numerical, and threshold num-of-N and X-of-N constructive operators. An analysis on 10 transparent synthetic datasets shows that EFC effectively identifies informative groups of attributes and constructs relevant features. Using 30 real-world classification datasets, we show significant improvements in classification accuracy for several classifiers and demonstrate the feasibility of the proposed feature construction even for large datasets. Finally, EFC generated interpretable features on a real-world problem from the financial industry, which were confirmed by a domain expert.

* Engineering Applications of Artificial Intelligence 120 (2023) 105823
* 54 pages, 10 figures, 22 tables

Via

Access Paper or Ask Questions

Unified Question Answering in Slovene

Nov 16, 2022

Katja Logar, Marko Robnik-Šikonja

Figure 1 for Unified Question Answering in Slovene

Figure 2 for Unified Question Answering in Slovene

Figure 3 for Unified Question Answering in Slovene

Figure 4 for Unified Question Answering in Slovene

Abstract:Question answering is one of the most challenging tasks in language understanding. Most approaches are developed for English, while less-resourced languages are much less researched. We adapt a successful English question-answering approach, called UnifiedQA, to the less-resourced Slovene language. Our adaptation uses the encoder-decoder transformer SloT5 and mT5 models to handle four question-answering formats: yes/no, multiple-choice, abstractive, and extractive. We use existing Slovene adaptations of four datasets, and machine translate the MCTest dataset. We show that a general model can answer questions in different formats at least as well as specialized models. The results are further improved using cross-lingual transfer from English. While we produce state-of-the-art results for Slovene, the performance still lags behind English.

* 4 pages,published in Proceedings of the 25th International Multiconference INFORMATION SOCIETY - IS 2012, Volume A -Slovenian Conference on Artificial Intelligence SCAI 2022, Ljubljana, 2022, pp. 23-26

Via

Access Paper or Ask Questions