Picture for Peter Rupnik

Peter Rupnik

Mići Princ -- A Little Boy Teaching Speech Technologies the Chakavian Dialect

Add code
Feb 03, 2026
Viaarxiv icon

The Growing Gains and Pains of Iterative Web Corpora Crawling: Insights from South Slavic CLASSLA-web 2.0 Corpora

Add code
Jan 16, 2026
Viaarxiv icon

State of the Art in Text Classification for South Slavic Languages: Fine-Tuning or Prompting?

Add code
Nov 11, 2025
Viaarxiv icon

Identifying Primary Stress Across Related Languages and Dialects with Transformer-based Speech Encoder Models

Add code
May 30, 2025
Figure 1 for Identifying Primary Stress Across Related Languages and Dialects with Transformer-based Speech Encoder Models
Figure 2 for Identifying Primary Stress Across Related Languages and Dialects with Transformer-based Speech Encoder Models
Figure 3 for Identifying Primary Stress Across Related Languages and Dialects with Transformer-based Speech Encoder Models
Figure 4 for Identifying Primary Stress Across Related Languages and Dialects with Transformer-based Speech Encoder Models
Viaarxiv icon

Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining

Add code
Apr 08, 2024
Figure 1 for Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining
Figure 2 for Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining
Figure 3 for Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining
Figure 4 for Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining
Viaarxiv icon

Do Language Models Care About Text Quality? Evaluating Web-Crawled Corpora Across 11 Languages

Add code
Mar 13, 2024
Viaarxiv icon

The ParlaSent multilingual training dataset for sentiment identification in parliamentary proceedings

Add code
Sep 18, 2023
Viaarxiv icon

The ParlaSent-BCS dataset of sentiment-annotated parliamentary debates from Bosnia-Herzegovina, Croatia, and Serbia

Add code
Jun 02, 2022
Figure 1 for The ParlaSent-BCS dataset of sentiment-annotated parliamentary debates from Bosnia-Herzegovina, Croatia, and Serbia
Figure 2 for The ParlaSent-BCS dataset of sentiment-annotated parliamentary debates from Bosnia-Herzegovina, Croatia, and Serbia
Figure 3 for The ParlaSent-BCS dataset of sentiment-annotated parliamentary debates from Bosnia-Herzegovina, Croatia, and Serbia
Figure 4 for The ParlaSent-BCS dataset of sentiment-annotated parliamentary debates from Bosnia-Herzegovina, Croatia, and Serbia
Viaarxiv icon

The GINCO Training Dataset for Web Genre Identification of Documents Out in the Wild

Add code
Jan 11, 2022
Figure 1 for The GINCO Training Dataset for Web Genre Identification of Documents Out in the Wild
Figure 2 for The GINCO Training Dataset for Web Genre Identification of Documents Out in the Wild
Figure 3 for The GINCO Training Dataset for Web Genre Identification of Documents Out in the Wild
Figure 4 for The GINCO Training Dataset for Web Genre Identification of Documents Out in the Wild
Viaarxiv icon