Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Gerasimos Spanakis

You Are What You Train: Effects of Data Composition on Training Context-aware Machine Translation Models

Sep 17, 2025

Paweł Mąka, Yusuf Can Semerci, Jan Scholtes, Gerasimos Spanakis

Abstract:Achieving human-level translations requires leveraging context to ensure coherence and handle complex phenomena like pronoun disambiguation. Sparsity of contextually rich examples in the standard training data has been hypothesized as the reason for the difficulty of context utilization. In this work, we systematically validate this claim in both single- and multilingual settings by constructing training datasets with a controlled proportions of contextually relevant examples. We demonstrate a strong association between training data sparsity and model performance confirming sparsity as a key bottleneck. Importantly, we reveal that improvements in one contextual phenomenon do no generalize to others. While we observe some cross-lingual transfer, it is not significantly higher between languages within the same sub-family. Finally, we propose and empirically evaluate two training strategies designed to leverage the available data. These strategies improve context utilization, resulting in accuracy gains of up to 6 and 8 percentage points on the ctxPro evaluation in single- and multilingual settings respectively.

* EMNLP 2025 main conference

Via

Access Paper or Ask Questions

Are Companies Taking AI Risks Seriously? A Systematic Analysis of Companies' AI Risk Disclosures in SEC 10-K forms

Aug 26, 2025

Lucas G. Uberti-Bona Marin, Bram Rijsbosch, Gerasimos Spanakis, Konrad Kollnig

Abstract:As Artificial Intelligence becomes increasingly central to corporate strategies, concerns over its risks are growing too. In response, regulators are pushing for greater transparency in how companies identify, report and mitigate AI-related risks. In the US, the Securities and Exchange Commission (SEC) repeatedly warned companies to provide their investors with more accurate disclosures of AI-related risks; recent enforcement and litigation against companies' misleading AI claims reinforce these warnings. In the EU, new laws - like the AI Act and Digital Services Act - introduced additional rules on AI risk reporting and mitigation. Given these developments, it is essential to examine if and how companies report AI-related risks to the public. This study presents the first large-scale systematic analysis of AI risk disclosures in SEC 10-K filings, which require public companies to report material risks to their company. We analyse over 30,000 filings from more than 7,000 companies over the past five years, combining quantitative and qualitative analysis. Our findings reveal a sharp increase in the companies that mention AI risk, up from 4% in 2020 to over 43% in the most recent 2024 filings. While legal and competitive AI risks are the most frequently mentioned, we also find growing attention to societal AI risks, such as cyberattacks, fraud, and technical limitations of AI systems. However, many disclosures remain generic or lack details on mitigation strategies, echoing concerns raised recently by the SEC about the quality of AI-related risk reporting. To support future research, we publicly release a web-based tool for easily extracting and analysing keyword-based disclosures across SEC filings.

* To be published in the ECML PKDD SoGood (Data Science for Social Good) workshop proceedings

Via

Access Paper or Ask Questions

Dutch CrowS-Pairs: Adapting a Challenge Dataset for Measuring Social Biases in Language Models for Dutch

Jul 22, 2025

Elza Strazda, Gerasimos Spanakis

Abstract:Warning: This paper contains explicit statements of offensive stereotypes which might be upsetting. Language models are prone to exhibiting biases, further amplifying unfair and harmful stereotypes. Given the fast-growing popularity and wide application of these models, it is necessary to ensure safe and fair language models. As of recent considerable attention has been paid to measuring bias in language models, yet the majority of studies have focused only on English language. A Dutch version of the US-specific CrowS-Pairs dataset for measuring bias in Dutch language models is introduced. The resulting dataset consists of 1463 sentence pairs that cover bias in 9 categories, such as Sexual orientation, Gender and Disability. The sentence pairs are composed of contrasting sentences, where one of the sentences concerns disadvantaged groups and the other advantaged groups. Using the Dutch CrowS-Pairs dataset, we show that various language models, BERTje, RobBERT, multilingual BERT, GEITje and Mistral-7B exhibit substantial bias across the various bias categories. Using the English and French versions of the CrowS-Pairs dataset, bias was evaluated in English (BERT and RoBERTa) and French (FlauBERT and CamemBERT) language models, and it was shown that English models exhibit the most bias, whereas Dutch models the least amount of bias. Additionally, results also indicate that assigning a persona to a language model changes the level of bias it exhibits. These findings highlight the variability of bias across languages and contexts, suggesting that cultural and linguistic factors play a significant role in shaping model biases.

* 10 pages, accepted at RANLP 2025 data and code here: https://github.com/jerryspan/Dutch-CrowS-Pairs

Via

Access Paper or Ask Questions

Computational Studies in Influencer Marketing: A Systematic Literature Review

Jun 17, 2025

Haoyang Gui, Thales Bertaglia, Catalina Goanta, Gerasimos Spanakis

Abstract:Influencer marketing has become a crucial feature of digital marketing strategies. Despite its rapid growth and algorithmic relevance, the field of computational studies in influencer marketing remains fragmented, especially with limited systematic reviews covering the computational methodologies employed. This makes overarching scientific measurements in the influencer economy very scarce, to the detriment of interested stakeholders outside of platforms themselves, such as regulators, but also researchers from other fields. This paper aims to provide an overview of the state of the art of computational studies in influencer marketing by conducting a systematic literature review (SLR) based on the PRISMA model. The paper analyses 69 studies to identify key research themes, methodologies, and future directions in this research field. The review identifies four major research themes: Influencer identification and characterisation, Advertising strategies and engagement, Sponsored content analysis and discovery, and Fairness. Methodologically, the studies are categorised into machine learning-based techniques (e.g., classification, clustering) and non-machine-learning-based techniques (e.g., statistical analysis, network analysis). Key findings reveal a strong focus on optimising commercial outcomes, with limited attention to regulatory compliance and ethical considerations. The review highlights the need for more nuanced computational research that incorporates contextual factors such as language, platform, and industry type, as well as improved model explainability and dataset reproducibility. The paper concludes by proposing a multidisciplinary research agenda that emphasises the need for further links to regulation and compliance technology, finer granularity in analysis, and the development of standardised datasets.

* journal submission, under review

Via

Access Paper or Ask Questions

A Representation Level Analysis of NMT Model Robustness to Grammatical Errors

May 27, 2025

Abderrahmane Issam, Yusuf Can Semerci, Jan Scholtes, Gerasimos Spanakis

Abstract:Understanding robustness is essential for building reliable NLP systems. Unfortunately, in the context of machine translation, previous work mainly focused on documenting robustness failures or improving robustness. In contrast, we study robustness from a model representation perspective by looking at internal model representations of ungrammatical inputs and how they evolve through model layers. For this purpose, we perform Grammatical Error Detection (GED) probing and representational similarity analysis. Our findings indicate that the encoder first detects the grammatical error, then corrects it by moving its representation toward the correct form. To understand what contributes to this process, we turn to the attention mechanism where we identify what we term Robustness Heads. We find that Robustness Heads attend to interpretable linguistic units when responding to grammatical errors, and that when we fine-tune models for robustness, they tend to rely more on Robustness Heads for updating the ungrammatical word representation.

* ACL 2025 Findings

Via

Access Paper or Ask Questions

MATCHED: Multimodal Authorship-Attribution To Combat Human Trafficking in Escort-Advertisement Data

Dec 18, 2024

Vageesh Saxena, Benjamin Bashpole, Gijs Van Dijck, Gerasimos Spanakis

Figure 1 for MATCHED: Multimodal Authorship-Attribution To Combat Human Trafficking in Escort-Advertisement Data

Figure 2 for MATCHED: Multimodal Authorship-Attribution To Combat Human Trafficking in Escort-Advertisement Data

Figure 3 for MATCHED: Multimodal Authorship-Attribution To Combat Human Trafficking in Escort-Advertisement Data

Figure 4 for MATCHED: Multimodal Authorship-Attribution To Combat Human Trafficking in Escort-Advertisement Data

Abstract:Human trafficking (HT) remains a critical issue, with traffickers increasingly leveraging online escort advertisements (ads) to advertise victims anonymously. Existing detection methods, including Authorship Attribution (AA), often center on text-based analyses and neglect the multimodal nature of online escort ads, which typically pair text with images. To address this gap, we introduce MATCHED, a multimodal dataset of 27,619 unique text descriptions and 55,115 unique images collected from the Backpage escort platform across seven U.S. cities in four geographical regions. Our study extensively benchmarks text-only, vision-only, and multimodal baselines for vendor identification and verification tasks, employing multitask (joint) training objectives that achieve superior classification and retrieval performance on in-distribution and out-of-distribution (OOD) datasets. Integrating multimodal features further enhances this performance, capturing complementary patterns across text and images. While text remains the dominant modality, visual data adds stylistic cues that enrich model performance. Moreover, text-image alignment strategies like CLIP and BLIP2 struggle due to low semantic overlap and vague connections between the modalities of escort ads, with end-to-end multimodal training proving more robust. Our findings emphasize the potential of multimodal AA (MAA) to combat HT, providing LEAs with robust tools to link ads and disrupt trafficking networks.

* 40 pages

Via

Access Paper or Ask Questions

Analyzing the Attention Heads for Pronoun Disambiguation in Context-aware Machine Translation Models

Dec 15, 2024

Paweł Mąka, Yusuf Can Semerci, Jan Scholtes, Gerasimos Spanakis

Abstract:In this paper, we investigate the role of attention heads in Context-aware Machine Translation models for pronoun disambiguation in the English-to-German and English-to-French language directions. We analyze their influence by both observing and modifying the attention scores corresponding to the plausible relations that could impact a pronoun prediction. Our findings reveal that while some heads do attend the relations of interest, not all of them influence the models' ability to disambiguate pronouns. We show that certain heads are underutilized by the models, suggesting that model performance could be improved if only the heads would attend one of the relations more strongly. Furthermore, we fine-tune the most promising heads and observe the increase in pronoun disambiguation accuracy of up to 5 percentage points which demonstrates that the improvements in performance can be solidified into the models' parameters.

* COLING 2025

Via

Access Paper or Ask Questions

LegalLens Shared Task 2024: Legal Violation Identification in Unstructured Text

Oct 15, 2024

Ben Hagag, Liav Harpaz, Gil Semo, Dor Bernsohn, Rohit Saha, Pashootan Vaezipoor, Kyryl Truskovskyi, Gerasimos Spanakis

Figure 1 for LegalLens Shared Task 2024: Legal Violation Identification in Unstructured Text

Figure 2 for LegalLens Shared Task 2024: Legal Violation Identification in Unstructured Text

Figure 3 for LegalLens Shared Task 2024: Legal Violation Identification in Unstructured Text

Figure 4 for LegalLens Shared Task 2024: Legal Violation Identification in Unstructured Text

Abstract:This paper presents the results of the LegalLens Shared Task, focusing on detecting legal violations within text in the wild across two sub-tasks: LegalLens-NER for identifying legal violation entities and LegalLens-NLI for associating these violations with relevant legal contexts and affected individuals. Using an enhanced LegalLens dataset covering labor, privacy, and consumer protection domains, 38 teams participated in the task. Our analysis reveals that while a mix of approaches was used, the top-performing teams in both tasks consistently relied on fine-tuning pre-trained language models, outperforming legal-specific models and few-shot methods. The top-performing team achieved a 7.11% improvement in NER over the baseline, while NLI saw a more marginal improvement of 5.7%. Despite these gains, the complexity of legal texts leaves room for further advancements.

Via

Access Paper or Ask Questions

Know When to Fuse: Investigating Non-English Hybrid Retrieval in the Legal Domain

Sep 02, 2024

Antoine Louis, Gijs van Dijck, Gerasimos Spanakis

Figure 1 for Know When to Fuse: Investigating Non-English Hybrid Retrieval in the Legal Domain

Figure 2 for Know When to Fuse: Investigating Non-English Hybrid Retrieval in the Legal Domain

Figure 3 for Know When to Fuse: Investigating Non-English Hybrid Retrieval in the Legal Domain

Figure 4 for Know When to Fuse: Investigating Non-English Hybrid Retrieval in the Legal Domain

Abstract:Hybrid search has emerged as an effective strategy to offset the limitations of different matching paradigms, especially in out-of-domain contexts where notable improvements in retrieval quality have been observed. However, existing research predominantly focuses on a limited set of retrieval methods, evaluated in pairs on domain-general datasets exclusively in English. In this work, we study the efficacy of hybrid search across a variety of prominent retrieval models within the unexplored field of law in the French language, assessing both zero-shot and in-domain scenarios. Our findings reveal that in a zero-shot context, fusing different domain-general models consistently enhances performance compared to using a standalone model, regardless of the fusion method. Surprisingly, when models are trained in-domain, we find that fusion generally diminishes performance relative to using the best single system, unless fusing scores with carefully tuned weights. These novel insights, among others, expand the applicability of prior findings across a new field and language, and contribute to a deeper understanding of hybrid search in non-English specialized domains.

* Under review

Via

Access Paper or Ask Questions

Fixed and Adaptive Simultaneous Machine Translation Strategies Using Adapters

Jul 18, 2024

Abderrahmane Issam, Yusuf Can Semerci, Jan Scholtes, Gerasimos Spanakis

Figure 1 for Fixed and Adaptive Simultaneous Machine Translation Strategies Using Adapters

Figure 2 for Fixed and Adaptive Simultaneous Machine Translation Strategies Using Adapters

Figure 3 for Fixed and Adaptive Simultaneous Machine Translation Strategies Using Adapters

Figure 4 for Fixed and Adaptive Simultaneous Machine Translation Strategies Using Adapters

Abstract:Simultaneous machine translation aims at solving the task of real-time translation by starting to translate before consuming the full input, which poses challenges in terms of balancing quality and latency of the translation. The wait-$k$ policy offers a solution by starting to translate after consuming $k$ words, where the choice of the number $k$ directly affects the latency and quality. In applications where we seek to keep the choice over latency and quality at inference, the wait-$k$ policy obliges us to train more than one model. In this paper, we address the challenge of building one model that can fulfil multiple latency levels and we achieve this by introducing lightweight adapter modules into the decoder. The adapters are trained to be specialized for different wait-$k$ values and compared to other techniques they offer more flexibility to allow for reaping the benefits of parameter sharing and minimizing interference. Additionally, we show that by combining with an adaptive strategy, we can further improve the results. Experiments on two language directions show that our method outperforms or competes with other strong baselines on most latency values.

* Accepted at IWSLT 2024

Via

Access Paper or Ask Questions