Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Robert Vacareanu

Online Rubrics Elicitation from Pairwise Comparisons

Oct 08, 2025

MohammadHossein Rezaei, Robert Vacareanu, Zihao Wang, Clinton Wang, Yunzhong He, Afra Feyza Akyürek

Figure 1 for Online Rubrics Elicitation from Pairwise Comparisons

Figure 2 for Online Rubrics Elicitation from Pairwise Comparisons

Figure 3 for Online Rubrics Elicitation from Pairwise Comparisons

Figure 4 for Online Rubrics Elicitation from Pairwise Comparisons

Abstract:Rubrics provide a flexible way to train LLMs on open-ended long-form answers where verifiable rewards are not applicable and human preferences provide coarse signals. Prior work shows that reinforcement learning with rubric-based rewards leads to consistent gains in LLM post-training. Most existing approaches rely on rubrics that remain static over the course of training. Such static rubrics, however, are vulnerable to reward-hacking type behaviors and fail to capture emergent desiderata that arise during training. We introduce Online Rubrics Elicitation (OnlineRubrics), a method that dynamically curates evaluation criteria in an online manner through pairwise comparisons of responses from current and reference policies. This online process enables continuous identification and mitigation of errors as training proceeds. Empirically, this approach yields consistent improvements of up to 8% over training exclusively with static rubrics across AlpacaEval, GPQA, ArenaHard as well as the validation sets of expert questions and rubrics. We qualitatively analyze the elicited criteria and identify prominent themes such as transparency, practicality, organization, and reasoning.

Via

Access Paper or Ask Questions

The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems

Mar 05, 2025

Richard Ren, Arunim Agarwal, Mantas Mazeika, Cristina Menghini, Robert Vacareanu, Brad Kenstler, Mick Yang, Isabelle Barrass, Alice Gatti, Xuwang Yin(+6 more)

Figure 1 for The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems

Figure 2 for The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems

Figure 3 for The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems

Figure 4 for The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems

Abstract:As large language models (LLMs) become more capable and agentic, the requirement for trust in their outputs grows significantly, yet at the same time concerns have been mounting that models may learn to lie in pursuit of their goals. To address these concerns, a body of work has emerged around the notion of "honesty" in LLMs, along with interventions aimed at mitigating deceptive behaviors. However, evaluations of honesty are currently highly limited, with no benchmark combining large scale and applicability to all models. Moreover, many benchmarks claiming to measure honesty in fact simply measure accuracy--the correctness of a model's beliefs--in disguise. In this work, we introduce a large-scale human-collected dataset for measuring honesty directly, allowing us to disentangle accuracy from honesty for the first time. Across a diverse set of LLMs, we find that while larger models obtain higher accuracy on our benchmark, they do not become more honest. Surprisingly, while most frontier LLMs obtain high scores on truthfulness benchmarks, we find a substantial propensity in frontier LLMs to lie when pressured to do so, resulting in low honesty scores on our benchmark. We find that simple methods, such as representation engineering interventions, can improve honesty. These results underscore the growing need for robust evaluations and effective interventions to ensure LLMs remain trustworthy.

* Website: https://www.mask-benchmark.ai

Via

Access Paper or Ask Questions

MorphNLI: A Stepwise Approach to Natural Language Inference Using Text Morphing

Feb 13, 2025

Vlad Andrei Negru, Robert Vacareanu, Camelia Lemnaru, Mihai Surdeanu, Rodica Potolea

Figure 1 for MorphNLI: A Stepwise Approach to Natural Language Inference Using Text Morphing

Figure 2 for MorphNLI: A Stepwise Approach to Natural Language Inference Using Text Morphing

Figure 3 for MorphNLI: A Stepwise Approach to Natural Language Inference Using Text Morphing

Figure 4 for MorphNLI: A Stepwise Approach to Natural Language Inference Using Text Morphing

Abstract:We introduce MorphNLI, a modular step-by-step approach to natural language inference (NLI). When classifying the premise-hypothesis pairs into {entailment, contradiction, neutral}, we use a language model to generate the necessary edits to incrementally transform (i.e., morph) the premise into the hypothesis. Then, using an off-the-shelf NLI model we track how the entailment progresses with these atomic changes, aggregating these intermediate labels into a final output. We demonstrate the advantages of our proposed method particularly in realistic cross-domain settings, where our method always outperforms strong baselines with improvements up to 12.6% (relative). Further, our proposed approach is explainable as the atomic edits can be used to understand the overall NLI label.

* 16 pages, 11 figures, 8 tables. Accepted for NAACL 2025 Findings

Via

Access Paper or Ask Questions

When and Where Did it Happen? An Encoder-Decoder Model to Identify Scenario Context

Oct 10, 2024

Enrique Noriega-Atala, Robert Vacareanu, Salena Torres Ashton, Adarsh Pyarelal, Clayton T. Morrison, Mihai Surdeanu

Figure 1 for When and Where Did it Happen? An Encoder-Decoder Model to Identify Scenario Context

Figure 2 for When and Where Did it Happen? An Encoder-Decoder Model to Identify Scenario Context

Figure 3 for When and Where Did it Happen? An Encoder-Decoder Model to Identify Scenario Context

Figure 4 for When and Where Did it Happen? An Encoder-Decoder Model to Identify Scenario Context

Abstract:We introduce a neural architecture finetuned for the task of scenario context generation: The relevant location and time of an event or entity mentioned in text. Contextualizing information extraction helps to scope the validity of automated finings when aggregating them as knowledge graphs. Our approach uses a high-quality curated dataset of time and location annotations in a corpus of epidemiology papers to train an encoder-decoder architecture. We also explored the use of data augmentation techniques during training. Our findings suggest that a relatively small fine-tuned encoder-decoder model performs better than out-of-the-box LLMs and semantic role labeling parsers to accurate predict the relevant scenario information of a particular entity or event.

* 9 pages, 7 figures

Via

Access Paper or Ask Questions

General Purpose Verification for Chain of Thought Prompting

Apr 30, 2024

Robert Vacareanu, Anurag Pratik, Evangelia Spiliopoulou, Zheng Qi, Giovanni Paolini, Neha Anna John, Jie Ma, Yassine Benajiba, Miguel Ballesteros

Figure 1 for General Purpose Verification for Chain of Thought Prompting

Figure 2 for General Purpose Verification for Chain of Thought Prompting

Figure 3 for General Purpose Verification for Chain of Thought Prompting

Figure 4 for General Purpose Verification for Chain of Thought Prompting

Abstract:Many of the recent capabilities demonstrated by Large Language Models (LLMs) arise primarily from their ability to exploit contextual information. In this paper, we explore ways to improve reasoning capabilities of LLMs through (1) exploration of different chains of thought and (2) validation of the individual steps of the reasoning process. We propose three general principles that a model should adhere to while reasoning: (i) Relevance, (ii) Mathematical Accuracy, and (iii) Logical Consistency. We apply these constraints to the reasoning steps generated by the LLM to improve the accuracy of the final generation. The constraints are applied in the form of verifiers: the model itself is asked to verify if the generated steps satisfy each constraint. To further steer the generations towards high-quality solutions, we use the perplexity of the reasoning steps as an additional verifier. We evaluate our method on 4 distinct types of reasoning tasks, spanning a total of 9 different datasets. Experiments show that our method is always better than vanilla generation, and, in 6 out of the 9 datasets, it is better than best-of N sampling which samples N reasoning chains and picks the lowest perplexity generation.

* 22 pages, preprint

Via

Access Paper or Ask Questions

From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples

Apr 11, 2024

Robert Vacareanu, Vlad-Andrei Negru, Vasile Suciu, Mihai Surdeanu

Figure 1 for From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples

Figure 2 for From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples

Figure 3 for From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples

Figure 4 for From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples

Abstract:We analyze how well pre-trained large language models (e.g., Llama2, GPT-4, Claude 3, etc) can do linear and non-linear regression when given in-context examples, without any additional training or gradient updates. Our findings reveal that several large language models (e.g., GPT-4, Claude 3) are able to perform regression tasks with a performance rivaling (or even outperforming) that of traditional supervised methods such as Random Forest, Bagging, or Gradient Boosting. For example, on the challenging Friedman #2 regression dataset, Claude 3 outperforms many supervised methods such as AdaBoost, SVM, Random Forest, KNN, or Gradient Boosting. We then investigate how well the performance of large language models scales with the number of in-context exemplars. We borrow from the notion of regret from online learning and empirically show that LLMs are capable of obtaining a sub-linear regret.

* 50 pages, 48 figures, preprint

Via

Access Paper or Ask Questions

Towards Realistic Few-Shot Relation Extraction: A New Meta Dataset and Evaluation

Apr 05, 2024

Fahmida Alam, Md Asiful Islam, Robert Vacareanu, Mihai Surdeanu

Figure 1 for Towards Realistic Few-Shot Relation Extraction: A New Meta Dataset and Evaluation

Figure 2 for Towards Realistic Few-Shot Relation Extraction: A New Meta Dataset and Evaluation

Figure 3 for Towards Realistic Few-Shot Relation Extraction: A New Meta Dataset and Evaluation

Figure 4 for Towards Realistic Few-Shot Relation Extraction: A New Meta Dataset and Evaluation

Abstract:We introduce a meta dataset for few-shot relation extraction, which includes two datasets derived from existing supervised relation extraction datasets NYT29 (Takanobu et al., 2019; Nayak and Ng, 2020) and WIKIDATA (Sorokin and Gurevych, 2017) as well as a few-shot form of the TACRED dataset (Sabo et al., 2021). Importantly, all these few-shot datasets were generated under realistic assumptions such as: the test relations are different from any relations a model might have seen before, limited training data, and a preponderance of candidate relation mentions that do not correspond to any of the relations of interest. Using this large resource, we conduct a comprehensive evaluation of six recent few-shot relation extraction methods, and observe that no method comes out as a clear winner. Further, the overall performance on this task is low, indicating substantial need for future research. We release all versions of the data, i.e., both supervised and few-shot, for future research.

Via

Access Paper or Ask Questions

Best of Both Worlds: A Pliable and Generalizable Neuro-Symbolic Approach for Relation Classification

Mar 05, 2024

Robert Vacareanu, Fahmida Alam, Md Asiful Islam, Haris Riaz, Mihai Surdeanu

Figure 1 for Best of Both Worlds: A Pliable and Generalizable Neuro-Symbolic Approach for Relation Classification

Figure 2 for Best of Both Worlds: A Pliable and Generalizable Neuro-Symbolic Approach for Relation Classification

Figure 3 for Best of Both Worlds: A Pliable and Generalizable Neuro-Symbolic Approach for Relation Classification

Figure 4 for Best of Both Worlds: A Pliable and Generalizable Neuro-Symbolic Approach for Relation Classification

Abstract:This paper introduces a novel neuro-symbolic architecture for relation classification (RC) that combines rule-based methods with contemporary deep learning techniques. This approach capitalizes on the strengths of both paradigms: the adaptability of rule-based systems and the generalization power of neural networks. Our architecture consists of two components: a declarative rule-based model for transparent classification and a neural component to enhance rule generalizability through semantic text matching. Notably, our semantic matcher is trained in an unsupervised domain-agnostic way, solely with synthetic data. Further, these components are loosely coupled, allowing for rule modifications without retraining the semantic matcher. In our evaluation, we focused on two few-shot relation classification datasets: Few-Shot TACRED and a Few-Shot version of NYT29. We show that our proposed method outperforms previous state-of-the-art models in three out of four settings, despite not seeing any human-annotated training data. Further, we show that our approach remains modular and pliable, i.e., the corresponding rules can be locally modified to improve the overall model. Human interventions to the rules for the TACRED relation \texttt{org:parents} boost the performance on that relation by as much as 26\% relative improvement, without negatively impacting the other relations, and without retraining the semantic matching component.

Via

Access Paper or Ask Questions

Synthetic Dataset for Evaluating Complex Compositional Knowledge for Natural Language Inference

Jul 12, 2023

Sushma Anand Akoju, Robert Vacareanu, Haris Riaz, Eduardo Blanco, Mihai Surdeanu

Abstract:We introduce a synthetic dataset called Sentences Involving Complex Compositional Knowledge (SICCK) and a novel analysis that investigates the performance of Natural Language Inference (NLI) models to understand compositionality in logic. We produce 1,304 sentence pairs by modifying 15 examples from the SICK dataset (Marelli et al., 2014). To this end, we modify the original texts using a set of phrases - modifiers that correspond to universal quantifiers, existential quantifiers, negation, and other concept modifiers in Natural Logic (NL) (MacCartney, 2009). We use these phrases to modify the subject, verb, and object parts of the premise and hypothesis. Lastly, we annotate these modified texts with the corresponding entailment labels following NL rules. We conduct a preliminary verification of how well the change in the structural and semantic composition is captured by neural NLI models, in both zero-shot and fine-tuned scenarios. We found that the performance of NLI models under the zero-shot setting is poor, especially for modified sentences with negation and existential quantifiers. After fine-tuning this dataset, we observe that models continue to perform poorly over negation, existential and universal modifiers.

* Accepted to Natural Language Reasoning and Structured Explanations (NLRSE) Workshop, ACL 2023. For dataset, please refer https://github.com/clulab/releases/tree/master/acl2023-nlrse-sicck and https://github.com/sushmaakoju/natural-logic

Via

Access Paper or Ask Questions

A Weak Supervision Approach for Few-Shot Aspect Based Sentiment

May 19, 2023

Robert Vacareanu, Siddharth Varia, Kishaloy Halder, Shuai Wang, Giovanni Paolini, Neha Anna John, Miguel Ballesteros, Smaranda Muresan

Figure 1 for A Weak Supervision Approach for Few-Shot Aspect Based Sentiment

Figure 2 for A Weak Supervision Approach for Few-Shot Aspect Based Sentiment

Figure 3 for A Weak Supervision Approach for Few-Shot Aspect Based Sentiment

Figure 4 for A Weak Supervision Approach for Few-Shot Aspect Based Sentiment

Abstract:We explore how weak supervision on abundant unlabeled data can be leveraged to improve few-shot performance in aspect-based sentiment analysis (ABSA) tasks. We propose a pipeline approach to construct a noisy ABSA dataset, and we use it to adapt a pre-trained sequence-to-sequence model to the ABSA tasks. We test the resulting model on three widely used ABSA datasets, before and after fine-tuning. Our proposed method preserves the full fine-tuning performance while showing significant improvements (15.84% absolute F1) in the few-shot learning scenario for the harder tasks. In zero-shot (i.e., without fine-tuning), our method outperforms the previous state of the art on the aspect extraction sentiment classification (AESC) task and is, additionally, capable of performing the harder aspect sentiment triplet extraction (ASTE) task.

Via

Access Paper or Ask Questions