Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Samir Abdaljalil

Evaluating Multilingual and Code-Switched Alignment in LLMs via Synthetic Natural Language Inference

Aug 20, 2025

Samir Abdaljalil, Erchin Serpedin, Khalid Qaraqe, Hasan Kurban

Abstract:Large language models (LLMs) are increasingly applied in multilingual contexts, yet their capacity for consistent, logically grounded alignment across languages remains underexplored. We present a controlled evaluation framework for multilingual natural language inference (NLI) that generates synthetic, logic-based premise-hypothesis pairs and translates them into a typologically diverse set of languages. This design enables precise control over semantic relations and allows testing in both monolingual and mixed-language (code-switched) conditions. Surprisingly, code-switching does not degrade, and can even improve, performance, suggesting that translation-induced lexical variation may serve as a regularization signal. We validate semantic preservation through embedding-based similarity analyses and cross-lingual alignment visualizations, confirming the fidelity of translated pairs. Our findings expose both the potential and the brittleness of current LLM cross-lingual reasoning, and identify code-switching as a promising lever for improving multilingual robustness. Code available at: https://github.com/KurbanIntelligenceLab/nli-stress-testing

* Under review

Via

Access Paper or Ask Questions

Theorem-of-Thought: A Multi-Agent Framework for Abductive, Deductive, and Inductive Reasoning in Language Models

Jun 08, 2025

Samir Abdaljalil, Hasan Kurban, Khalid Qaraqe, Erchin Serpedin

Abstract:Large language models (LLMs) have shown strong performance across natural language reasoning tasks, yet their reasoning processes remain brittle and difficult to interpret. Prompting techniques like Chain-of-Thought (CoT) enhance reliability by eliciting intermediate reasoning steps or aggregating multiple outputs. However, they lack mechanisms for enforcing logical structure and assessing internal coherence. We introduce Theorem-of-Thought (ToTh), a novel framework that models reasoning as collaboration among three parallel agents, each simulating a distinct mode of inference: abductive, deductive, and inductive. Each agent produces a reasoning trace, which is structured into a formal reasoning graph. To evaluate consistency, we apply Bayesian belief propagation guided by natural language inference (NLI), assigning confidence scores to each step. The most coherent graph is selected to derive the final answer. Experiments on symbolic (WebOfLies) and numerical (MultiArith) reasoning benchmarks show that ToTh consistently outperforms CoT, Self-Consistency, and CoT-Decoding across multiple LLMs, while producing interpretable and logically grounded reasoning chains. Our findings suggest a promising direction for building more robust and cognitively inspired LLM reasoning. The implementation is available at https://github.com/KurbanIntelligenceLab/theorem-of-thought.

Via

Access Paper or Ask Questions

HalluVerse25: Fine-grained Multilingual Benchmark Dataset for LLM Hallucinations

Mar 10, 2025

Samir Abdaljalil, Hasan Kurban, Erchin Serpedin

Figure 1 for HalluVerse25: Fine-grained Multilingual Benchmark Dataset for LLM Hallucinations

Figure 2 for HalluVerse25: Fine-grained Multilingual Benchmark Dataset for LLM Hallucinations

Figure 3 for HalluVerse25: Fine-grained Multilingual Benchmark Dataset for LLM Hallucinations

Figure 4 for HalluVerse25: Fine-grained Multilingual Benchmark Dataset for LLM Hallucinations

Abstract:Large Language Models (LLMs) are increasingly used in various contexts, yet remain prone to generating non-factual content, commonly referred to as "hallucinations". The literature categorizes hallucinations into several types, including entity-level, relation-level, and sentence-level hallucinations. However, existing hallucination datasets often fail to capture fine-grained hallucinations in multilingual settings. In this work, we introduce HalluVerse25, a multilingual LLM hallucination dataset that categorizes fine-grained hallucinations in English, Arabic, and Turkish. Our dataset construction pipeline uses an LLM to inject hallucinations into factual biographical sentences, followed by a rigorous human annotation process to ensure data quality. We evaluate several LLMs on HalluVerse25, providing valuable insights into how proprietary models perform in detecting LLM-generated hallucinations across different contexts.

Via

Access Paper or Ask Questions

SINdex: Semantic INconsistency Index for Hallucination Detection in LLMs

Mar 07, 2025

Samir Abdaljalil, Hasan Kurban, Parichit Sharma, Erchin Serpedin, Rachad Atat

Abstract:Large language models (LLMs) are increasingly deployed across diverse domains, yet they are prone to generating factually incorrect outputs - commonly known as "hallucinations." Among existing mitigation strategies, uncertainty-based methods are particularly attractive due to their ease of implementation, independence from external data, and compatibility with standard LLMs. In this work, we introduce a novel and scalable uncertainty-based semantic clustering framework for automated hallucination detection. Our approach leverages sentence embeddings and hierarchical clustering alongside a newly proposed inconsistency measure, SINdex, to yield more homogeneous clusters and more accurate detection of hallucination phenomena across various LLMs. Evaluations on prominent open- and closed-book QA datasets demonstrate that our method achieves AUROC improvements of up to 9.3% over state-of-the-art techniques. Extensive ablation studies further validate the effectiveness of each component in our framework.

Via

Access Paper or Ask Questions

SAFE: A Sparse Autoencoder-Based Framework for Robust Query Enrichment and Hallucination Mitigation in LLMs

Mar 04, 2025

Samir Abdaljalil, Filippo Pallucchini, Andrea Seveso, Hasan Kurban, Fabio Mercorio, Erchin Serpedin

Abstract:Despite the state-of-the-art performance of Large Language Models (LLMs), these models often suffer from hallucinations, which can undermine their performance in critical applications. In this work, we propose SAFE, a novel method for detecting and mitigating hallucinations by leveraging Sparse Autoencoders (SAEs). While hallucination detection techniques and SAEs have been explored independently, their synergistic application in a comprehensive system, particularly for hallucination-aware query enrichment, has not been fully investigated. To validate the effectiveness of SAFE, we evaluate it on two models with available SAEs across three diverse cross-domain datasets designed to assess hallucination problems. Empirical results demonstrate that SAFE consistently improves query generation accuracy and mitigates hallucinations across all datasets, achieving accuracy improvements of up to 29.45%.

Via

Access Paper or Ask Questions

ArAIEval Shared Task: Persuasion Techniques and Disinformation Detection in Arabic Text

Nov 06, 2023

Maram Hasanain, Firoj Alam, Hamdy Mubarak, Samir Abdaljalil, Wajdi Zaghouani, Preslav Nakov, Giovanni Da San Martino, Abed Alhakim Freihat

Figure 1 for ArAIEval Shared Task: Persuasion Techniques and Disinformation Detection in Arabic Text

Figure 2 for ArAIEval Shared Task: Persuasion Techniques and Disinformation Detection in Arabic Text

Figure 3 for ArAIEval Shared Task: Persuasion Techniques and Disinformation Detection in Arabic Text

Figure 4 for ArAIEval Shared Task: Persuasion Techniques and Disinformation Detection in Arabic Text

Abstract:We present an overview of the ArAIEval shared task, organized as part of the first ArabicNLP 2023 conference co-located with EMNLP 2023. ArAIEval offers two tasks over Arabic text: (i) persuasion technique detection, focusing on identifying persuasion techniques in tweets and news articles, and (ii) disinformation detection in binary and multiclass setups over tweets. A total of 20 teams participated in the final evaluation phase, with 14 and 16 teams participating in Tasks 1 and 2, respectively. Across both tasks, we observed that fine-tuning transformer models such as AraBERT was at the core of the majority of the participating systems. We provide a description of the task setup, including a description of the dataset construction and the evaluation setup. We further give a brief overview of the participating systems. All datasets and evaluation scripts from the shared task are released to the research community. (https://araieval.gitlab.io/) We hope this will enable further research on these important tasks in Arabic.

* Accepted at ArabicNLP-23 (EMNLP-23), propaganda, disinformation, misinformation, fake news

Via

Access Paper or Ask Questions

LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking

Aug 09, 2023

Fahim Dalvi, Maram Hasanain, Sabri Boughorbel, Basel Mousi, Samir Abdaljalil, Nizi Nazar, Ahmed Abdelali, Shammur Absar Chowdhury, Hamdy Mubarak, Ahmed Ali(+3 more)

Figure 1 for LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking

Figure 2 for LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking

Figure 3 for LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking

Abstract:The recent development and success of Large Language Models (LLMs) necessitate an evaluation of their performance across diverse NLP tasks in different languages. Although several frameworks have been developed and made publicly available, their customization capabilities for specific tasks and datasets are often complex for different users. In this study, we introduce the LLMeBench framework. Initially developed to evaluate Arabic NLP tasks using OpenAI's GPT and BLOOM models; it can be seamlessly customized for any NLP task and model, regardless of language. The framework also features zero- and few-shot learning settings. A new custom dataset can be added in less than 10 minutes, and users can use their own model API keys to evaluate the task at hand. The developed framework has been already tested on 31 unique NLP tasks using 53 publicly available datasets within 90 experimental setups, involving approximately 296K data points. We plan to open-source the framework for the community (https://github.com/qcri/LLMeBench/). A video demonstrating the framework is available online (https://youtu.be/FkQn4UjYA0s).

* Foundation Models, Large Language Models, NLP, CHatGPT Evaluation, LLMs Benchmark

Via

Access Paper or Ask Questions

Detecting and Reasoning of Deleted Tweets before they are Posted

May 05, 2023

Hamdy Mubarak, Samir Abdaljalil, Azza Nassar, Firoj Alam

Figure 1 for Detecting and Reasoning of Deleted Tweets before they are Posted

Figure 2 for Detecting and Reasoning of Deleted Tweets before they are Posted

Figure 3 for Detecting and Reasoning of Deleted Tweets before they are Posted

Figure 4 for Detecting and Reasoning of Deleted Tweets before they are Posted

Abstract:Social media platforms empower us in several ways, from information dissemination to consumption. While these platforms are useful in promoting citizen journalism, public awareness etc., they have misuse potentials. Malicious users use them to disseminate hate-speech, offensive content, rumor etc. to gain social and political agendas or to harm individuals, entities and organizations. Often times, general users unconsciously share information without verifying it, or unintentionally post harmful messages. Some of such content often get deleted either by the platform due to the violation of terms and policies, or users themselves for different reasons, e.g., regrets. There is a wide range of studies in characterizing, understanding and predicting deleted content. However, studies which aims to identify the fine-grained reasons (e.g., posts are offensive, hate speech or no identifiable reason) behind deleted content, are limited. In this study we address this gap, by identifying deleted tweets, particularly within the Arabic context, and labeling them with a corresponding fine-grained disinformation category. We then develop models that can predict the potentiality of tweets getting deleted, as well as the potential reasons behind deletion. Such models can help in moderating social media posts before even posting.

* disinformation, misinformation, fake news

Via

Access Paper or Ask Questions