Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wojciech Kusa

ConECT Dataset: Overcoming Data Scarcity in Context-Aware E-Commerce MT

Jun 05, 2025

Mikołaj Pokrywka, Wojciech Kusa, Mieszko Rutkowski, Mikołaj Koszowski

Abstract:Neural Machine Translation (NMT) has improved translation by using Transformer-based models, but it still struggles with word ambiguity and context. This problem is especially important in domain-specific applications, which often have problems with unclear sentences or poor data quality. Our research explores how adding information to models can improve translations in the context of e-commerce data. To this end we create ConECT -- a new Czech-to-Polish e-commerce product translation dataset coupled with images and product metadata consisting of 11,400 sentence pairs. We then investigate and compare different methods that are applicable to context-aware translation. We test a vision-language model (VLM), finding that visual context aids translation quality. Additionally, we explore the incorporation of contextual information into text-to-text models, such as the product's category path or image descriptions. The results of our study demonstrate that the incorporation of contextual information leads to an improvement in the quality of machine translation. We make the new dataset publicly available.

* Accepted at ACL 2025 (The 63rd Annual Meeting of the Association for Computational Linguistics)

Via

Access Paper or Ask Questions

ASPIRE: Assistive System for Performance Evaluation in IR

Dec 20, 2024

Georgios Peikos, Wojciech Kusa, Symeon Symeonidis

Figure 1 for ASPIRE: Assistive System for Performance Evaluation in IR

Figure 2 for ASPIRE: Assistive System for Performance Evaluation in IR

Abstract:Information Retrieval (IR) evaluation involves far more complexity than merely presenting performance measures in a table. Researchers often need to compare multiple models across various dimensions, such as the Precision-Recall trade-off and response time, to understand the reasons behind the varying performance of specific queries for different models. We introduce ASPIRE (Assistive System for Performance Evaluation in IR), a visual analytics tool designed to address these complexities by providing an extensive and user-friendly interface for in-depth analysis of IR experiments. ASPIRE supports four key aspects of IR experiment evaluation and analysis: single/multi-experiment comparisons, query-level analysis, query characteristics-performance interplay, and collection-based retrieval analysis. We showcase the functionality of ASPIRE using the TREC Clinical Trials collection. ASPIRE is an open-source toolkit available online: https://github.com/GiorgosPeikos/ASPIRE

* Accepted as a demo paper at the 47th European Conference on Information Retrieval (ECIR)

Via

Access Paper or Ask Questions

A Reproducibility and Generalizability Study of Large Language Models for Query Generation

Nov 22, 2024

Moritz Staudinger, Wojciech Kusa, Florina Piroi, Aldo Lipani, Allan Hanbury

Figure 1 for A Reproducibility and Generalizability Study of Large Language Models for Query Generation

Figure 2 for A Reproducibility and Generalizability Study of Large Language Models for Query Generation

Figure 3 for A Reproducibility and Generalizability Study of Large Language Models for Query Generation

Figure 4 for A Reproducibility and Generalizability Study of Large Language Models for Query Generation

Abstract:Systematic literature reviews (SLRs) are a cornerstone of academic research, yet they are often labour-intensive and time-consuming due to the detailed literature curation process. The advent of generative AI and large language models (LLMs) promises to revolutionize this process by assisting researchers in several tedious tasks, one of them being the generation of effective Boolean queries that will select the publications to consider including in a review. This paper presents an extensive study of Boolean query generation using LLMs for systematic reviews, reproducing and extending the work of Wang et al. and Alaniz et al. Our study investigates the replicability and reliability of results achieved using ChatGPT and compares its performance with open-source alternatives like Mistral and Zephyr to provide a more comprehensive analysis of LLMs for query generation. Therefore, we implemented a pipeline, which automatically creates a Boolean query for a given review topic by using a previously defined LLM, retrieves all documents for this query from the PubMed database and then evaluates the results. With this pipeline we first assess whether the results obtained using ChatGPT for query generation are reproducible and consistent. We then generalize our results by analyzing and evaluating open-source models and evaluating their efficacy in generating Boolean queries. Finally, we conduct a failure analysis to identify and discuss the limitations and shortcomings of using LLMs for Boolean query generation. This examination helps to understand the gaps and potential areas for improvement in the application of LLMs to information retrieval tasks. Our findings highlight the strengths, limitations, and potential of LLMs in the domain of information retrieval and literature review automation.

Via

Access Paper or Ask Questions

AustroTox: A Dataset for Target-Based Austrian German Offensive Language Detection

Jun 12, 2024

Pia Pachinger, Janis Goldzycher, Anna Maria Planitzer, Wojciech Kusa, Allan Hanbury, Julia Neidhardt

Figure 1 for AustroTox: A Dataset for Target-Based Austrian German Offensive Language Detection

Figure 2 for AustroTox: A Dataset for Target-Based Austrian German Offensive Language Detection

Figure 3 for AustroTox: A Dataset for Target-Based Austrian German Offensive Language Detection

Figure 4 for AustroTox: A Dataset for Target-Based Austrian German Offensive Language Detection

Abstract:Model interpretability in toxicity detection greatly profits from token-level annotations. However, currently such annotations are only available in English. We introduce a dataset annotated for offensive language detection sourced from a news forum, notable for its incorporation of the Austrian German dialect, comprising 4,562 user comments. In addition to binary offensiveness classification, we identify spans within each comment constituting vulgar language or representing targets of offensive statements. We evaluate fine-tuned language models as well as large language models in a zero- and few-shot fashion. The results indicate that while fine-tuned models excel in detecting linguistic peculiarities such as vulgar dialect, large language models demonstrate superior performance in detecting offensiveness in AustroTox. We publish the data and code.

* Accepted to Findings of the Association for Computational Linguistics: ACL 2024

Via

Access Paper or Ask Questions

Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order

Mar 30, 2024

Taishi Nakamura, Mayank Mishra, Simone Tedeschi, Yekun Chai, Jason T Stillerman, Felix Friedrich, Prateek Yadav, Tanmay Laud, Vu Minh Chien, Terry Yue Zhuo(+35 more)

Figure 1 for Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order

Figure 2 for Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order

Figure 3 for Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order

Figure 4 for Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order

Abstract:Pretrained language models underpin several AI applications, but their high computational cost for training limits accessibility. Initiatives such as BLOOM and StarCoder aim to democratize access to pretrained models for collaborative community development. However, such existing models face challenges: limited multilingual capabilities, continual pretraining causing catastrophic forgetting, whereas pretraining from scratch is computationally expensive, and compliance with AI safety and development laws. This paper presents Aurora-M, a 15B parameter multilingual open-source model trained on English, Finnish, Hindi, Japanese, Vietnamese, and code. Continually pretrained from StarCoderPlus on 435 billion additional tokens, Aurora-M surpasses 2 trillion tokens in total training token count. It is the first open-source multilingual model fine-tuned on human-reviewed safety instructions, thus aligning its development not only with conventional red-teaming considerations, but also with the specific concerns articulated in the Biden-Harris Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. Aurora-M is rigorously evaluated across various tasks and languages, demonstrating robustness against catastrophic forgetting and outperforming alternatives in multilingual settings, particularly in safety evaluations. To promote responsible open-source LLM development, Aurora-M and its variants are released at https://huggingface.co/collections/aurora-m/aurora-m-models-65fdfdff62471e09812f5407 .

* Preprint

Via

Access Paper or Ask Questions

CSMeD: Bridging the Dataset Gap in Automated Citation Screening for Systematic Literature Reviews

Nov 21, 2023

Wojciech Kusa, Oscar E. Mendoza, Matthias Samwald, Petr Knoth, Allan Hanbury

Figure 1 for CSMeD: Bridging the Dataset Gap in Automated Citation Screening for Systematic Literature Reviews

Figure 2 for CSMeD: Bridging the Dataset Gap in Automated Citation Screening for Systematic Literature Reviews

Figure 3 for CSMeD: Bridging the Dataset Gap in Automated Citation Screening for Systematic Literature Reviews

Figure 4 for CSMeD: Bridging the Dataset Gap in Automated Citation Screening for Systematic Literature Reviews

Abstract:Systematic literature reviews (SLRs) play an essential role in summarising, synthesising and validating scientific evidence. In recent years, there has been a growing interest in using machine learning techniques to automate the identification of relevant studies for SLRs. However, the lack of standardised evaluation datasets makes comparing the performance of such automated literature screening systems difficult. In this paper, we analyse the citation screening evaluation datasets, revealing that many of the available datasets are either too small, suffer from data leakage or have limited applicability to systems treating automated literature screening as a classification task, as opposed to, for example, a retrieval or question-answering task. To address these challenges, we introduce CSMeD, a meta-dataset consolidating nine publicly released collections, providing unified access to 325 SLRs from the fields of medicine and computer science. CSMeD serves as a comprehensive resource for training and evaluating the performance of automated citation screening models. Additionally, we introduce CSMeD-FT, a new dataset designed explicitly for evaluating the full text publication screening task. To demonstrate the utility of CSMeD, we conduct experiments and establish baselines on new datasets.

* Accepted at NeurIPS 2023 Datasets and Benchmarks Track

Via

Access Paper or Ask Questions

CRUISE-Screening: Living Literature Reviews Toolbox

Sep 04, 2023

Wojciech Kusa, Petr Knoth, Allan Hanbury

Figure 1 for CRUISE-Screening: Living Literature Reviews Toolbox

Figure 2 for CRUISE-Screening: Living Literature Reviews Toolbox

Figure 3 for CRUISE-Screening: Living Literature Reviews Toolbox

Abstract:Keeping up with research and finding related work is still a time-consuming task for academics. Researchers sift through thousands of studies to identify a few relevant ones. Automation techniques can help by increasing the efficiency and effectiveness of this task. To this end, we developed CRUISE-Screening, a web-based application for conducting living literature reviews - a type of literature review that is continuously updated to reflect the latest research in a particular field. CRUISE-Screening is connected to several search engines via an API, which allows for updating the search results periodically. Moreover, it can facilitate the process of screening for relevant publications by using text classification and question answering models. CRUISE-Screening can be used both by researchers conducting literature reviews and by those working on automating the citation screening process to validate their algorithms. The application is open-source: https://github.com/ProjectDoSSIER/cruise-screening, and a demo is available under this URL: https://citation-screening.ec.tuwien.ac.at. We discuss the limitations of our tool in Appendix A.

* Paper accepted at CIKM 2023. The arXiv version has an extra section about limitations in the Appendix that is not present in the ACM version

Via

Access Paper or Ask Questions

Effective Matching of Patients to Clinical Trials using Entity Extraction and Neural Re-ranking

Jul 01, 2023

Wojciech Kusa, Óscar E. Mendoza, Petr Knoth, Gabriella Pasi, Allan Hanbury

Figure 1 for Effective Matching of Patients to Clinical Trials using Entity Extraction and Neural Re-ranking

Figure 2 for Effective Matching of Patients to Clinical Trials using Entity Extraction and Neural Re-ranking

Figure 3 for Effective Matching of Patients to Clinical Trials using Entity Extraction and Neural Re-ranking

Figure 4 for Effective Matching of Patients to Clinical Trials using Entity Extraction and Neural Re-ranking

Abstract:Clinical trials (CTs) often fail due to inadequate patient recruitment. This paper tackles the challenges of CT retrieval by presenting an approach that addresses the patient-to-trials paradigm. Our approach involves two key components in a pipeline-based model: (i) a data enrichment technique for enhancing both queries and documents during the first retrieval stage, and (ii) a novel re-ranking schema that uses a Transformer network in a setup adapted to this task by leveraging the structure of the CT documents. We use named entity recognition and negation detection in both patient description and the eligibility section of CTs. We further classify patient descriptions and CT eligibility criteria into current, past, and family medical conditions. This extracted information is used to boost the importance of disease and drug mentions in both query and index for lexical retrieval. Furthermore, we propose a two-step training schema for the Transformer network used to re-rank the results from the lexical retrieval. The first step focuses on matching patient information with the descriptive sections of trials, while the second step aims to determine eligibility by matching patient information with the criteria section. Our findings indicate that the inclusion criteria section of the CT has a great influence on the relevance score in lexical models, and that the enrichment techniques for queries and documents improve the retrieval of relevant trials. The re-ranking strategy, based on our training schema, consistently enhances CT retrieval and shows improved performance by 15\% in terms of precision at retrieving eligible trials. The results of our experiments suggest the benefit of making use of extracted entities. Moreover, our proposed re-ranking schema shows promising effectiveness compared to larger neural models, even with limited training data.

* Under review

Via

Access Paper or Ask Questions

Outcome-based Evaluation of Systematic Review Automation

Jun 30, 2023

Wojciech Kusa, Guido Zuccon, Petr Knoth, Allan Hanbury

Figure 1 for Outcome-based Evaluation of Systematic Review Automation

Figure 2 for Outcome-based Evaluation of Systematic Review Automation

Figure 3 for Outcome-based Evaluation of Systematic Review Automation

Figure 4 for Outcome-based Evaluation of Systematic Review Automation

Abstract:Current methods of evaluating search strategies and automated citation screening for systematic literature reviews typically rely on counting the number of relevant and not relevant publications. This established practice, however, does not accurately reflect the reality of conducting a systematic review, because not all included publications have the same influence on the final outcome of the systematic review. More specifically, if an important publication gets excluded or included, this might significantly change the overall review outcome, while not including or excluding less influential studies may only have a limited impact. However, in terms of evaluation measures, all inclusion and exclusion decisions are treated equally and, therefore, failing to retrieve publications with little to no impact on the review outcome leads to the same decrease in recall as failing to retrieve crucial publications. We propose a new evaluation framework that takes into account the impact of the reported study on the overall systematic review outcome. We demonstrate the framework by extracting review meta-analysis data and estimating outcome effects using predictions from ranking runs on systematic reviews of interventions from CLEF TAR 2019 shared task. We further measure how closely the obtained outcomes are to the outcomes of the original review if the arbitrary rankings were used. We evaluate 74 runs using the proposed framework and compare the results with those obtained using standard IR measures. We find that accounting for the difference in review outcomes leads to a different assessment of the quality of a system than if traditional evaluation measures were used. Our analysis provides new insights into the evaluation of retrieval results in the context of systematic review automation, emphasising the importance of assessing the usefulness of each document beyond binary relevance.

* Accepted at ICTIR2023

Via

Access Paper or Ask Questions

Statute-enhanced lexical retrieval of court cases for COLIEE 2022

Apr 17, 2023

Tobias Fink, Gabor Recski, Wojciech Kusa, Allan Hanbury

Figure 1 for Statute-enhanced lexical retrieval of court cases for COLIEE 2022

Figure 2 for Statute-enhanced lexical retrieval of court cases for COLIEE 2022

Figure 3 for Statute-enhanced lexical retrieval of court cases for COLIEE 2022

Abstract:We discuss our experiments for COLIEE Task 1, a court case retrieval competition using cases from the Federal Court of Canada. During experiments on the training data we observe that passage level retrieval with rank fusion outperforms document level retrieval. By explicitly adding extracted statute information to the queries and documents we can further improve the results. We submit two passage level runs to the competition, which achieve high recall but low precision.

* Sixteenth International Workshop on Juris-informatics (JURISIN). 2022

Via

Access Paper or Ask Questions