Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Vikram Goyal

Tox-BART: Leveraging Toxicity Attributes for Explanation Generation of Implicit Hate Speech

Jun 06, 2024

Neemesh Yadav, Sarah Masud, Vikram Goyal, Md Shad Akhtar, Tanmoy Chakraborty

Figure 1 for Tox-BART: Leveraging Toxicity Attributes for Explanation Generation of Implicit Hate Speech

Figure 2 for Tox-BART: Leveraging Toxicity Attributes for Explanation Generation of Implicit Hate Speech

Figure 3 for Tox-BART: Leveraging Toxicity Attributes for Explanation Generation of Implicit Hate Speech

Figure 4 for Tox-BART: Leveraging Toxicity Attributes for Explanation Generation of Implicit Hate Speech

Abstract:Employing language models to generate explanations for an incoming implicit hate post is an active area of research. The explanation is intended to make explicit the underlying stereotype and aid content moderators. The training often combines top-k relevant knowledge graph (KG) tuples to provide world knowledge and improve performance on standard metrics. Interestingly, our study presents conflicting evidence for the role of the quality of KG tuples in generating implicit explanations. Consequently, simpler models incorporating external toxicity signals outperform KG-infused models. Compared to the KG-based setup, we observe a comparable performance for SBIC (LatentHatred) datasets with a performance variation of +0.44 (+0.49), +1.83 (-1.56), and -4.59 (+0.77) in BLEU, ROUGE-L, and BERTScore. Further human evaluation and error analysis reveal that our proposed setup produces more precise explanations than zero-shot GPT-3.5, highlighting the intricate nature of the task.

* 17 Pages, 5 Figures, 13 Tables, ACL Findings 2024

Via

Access Paper or Ask Questions

SymTax: Symbiotic Relationship and Taxonomy Fusion for Effective Citation Recommendation

May 26, 2024

Karan Goyal, Mayank Goel, Vikram Goyal, Mukesh Mohania

Figure 1 for SymTax: Symbiotic Relationship and Taxonomy Fusion for Effective Citation Recommendation

Figure 2 for SymTax: Symbiotic Relationship and Taxonomy Fusion for Effective Citation Recommendation

Figure 3 for SymTax: Symbiotic Relationship and Taxonomy Fusion for Effective Citation Recommendation

Figure 4 for SymTax: Symbiotic Relationship and Taxonomy Fusion for Effective Citation Recommendation

Abstract:Citing pertinent literature is pivotal to writing and reviewing a scientific document. Existing techniques mainly focus on the local context or the global context for recommending citations but fail to consider the actual human citation behaviour. We propose SymTax, a three-stage recommendation architecture that considers both the local and the global context, and additionally the taxonomical representations of query-candidate tuples and the Symbiosis prevailing amongst them. SymTax learns to embed the infused taxonomies in the hyperbolic space and uses hyperbolic separation as a latent feature to compute query-candidate similarity. We build a novel and large dataset ArSyTa containing 8.27 million citation contexts and describe the creation process in detail. We conduct extensive experiments and ablation studies to demonstrate the effectiveness and design choice of each module in our framework. Also, combinatorial analysis from our experiments shed light on the choice of language models (LMs) and fusion embedding, and the inclusion of section heading as a signal. Our proposed module that captures the symbiotic relationship solely leads to performance gains of 26.66% and 39.25% in Recall@5 w.r.t. SOTA on ACL-200 and RefSeer datasets, respectively. The complete framework yields a gain of 22.56% in Recall@5 wrt SOTA on our proposed dataset. The code and dataset are available at https://github.com/goyalkaraniit/SymTax

* Accepted in ACL 2024

Via

Access Paper or Ask Questions

Probing Critical Learning Dynamics of PLMs for Hate Speech Detection

Feb 03, 2024

Sarah Masud, Mohammad Aflah Khan, Vikram Goyal, Md Shad Akhtar, Tanmoy Chakraborty

Abstract:Despite the widespread adoption, there is a lack of research into how various critical aspects of pretrained language models (PLMs) affect their performance in hate speech detection. Through five research questions, our findings and recommendations lay the groundwork for empirically investigating different aspects of PLMs' use in hate speech detection. We deep dive into comparing different pretrained models, evaluating their seed robustness, finetuning settings, and the impact of pretraining data collection time. Our analysis reveals early peaks for downstream tasks during pretraining, the limited benefit of employing a more recent pretraining corpus, and the significance of specific layers during finetuning. We further call into question the use of domain-specific models and highlight the need for dynamic datasets for benchmarking hate speech detection.

* 20 pages, 9 figures, 14 tables. Accepted at EACL'24

Via

Access Paper or Ask Questions

Exploiting Representation Bias for Data Distillation in Abstractive Text Summarization

Dec 20, 2023

Yash Kumar Atri, Vikram Goyal, Tanmoy Chakraborty

Abstract:Abstractive text summarization is surging with the number of training samples to cater to the needs of the deep learning models. These models tend to exploit the training data representations to attain superior performance by improving the quantitative element of the resultant summary. However, increasing the size of the training set may not always be the ideal solution to maximize the performance, and therefore, a need to revisit the quality of training samples and the learning protocol of deep learning models is a must. In this paper, we aim to discretize the vector space of the abstractive text summarization models to understand the characteristics learned between the input embedding space and the models' encoder space. We show that deep models fail to capture the diversity of the input space. Further, the distribution of data points on the encoder space indicates that an unchecked increase in the training samples does not add value; rather, a tear-down of data samples is highly needed to make the models focus on variability and faithfulness. We employ clustering techniques to learn the diversity of a model's sample space and how data points are mapped from the embedding space to the encoder space and vice versa. Further, we devise a metric to filter out redundant data points to make the model more robust and less data hungry. We benchmark our proposed method using quantitative metrics, such as Rouge, and qualitative metrics, such as BERTScore, FEQA and Pyramid score. We also quantify the reasons that inhibit the models from learning the diversity from the varied input samples.

Via

Access Paper or Ask Questions

Army of Thieves: Enhancing Black-Box Model Extraction via Ensemble based sample selection

Nov 08, 2023

Akshit Jindal, Vikram Goyal, Saket Anand, Chetan Arora

Figure 1 for Army of Thieves: Enhancing Black-Box Model Extraction via Ensemble based sample selection

Figure 2 for Army of Thieves: Enhancing Black-Box Model Extraction via Ensemble based sample selection

Figure 3 for Army of Thieves: Enhancing Black-Box Model Extraction via Ensemble based sample selection

Figure 4 for Army of Thieves: Enhancing Black-Box Model Extraction via Ensemble based sample selection

Abstract:Machine Learning (ML) models become vulnerable to Model Stealing Attacks (MSA) when they are deployed as a service. In such attacks, the deployed model is queried repeatedly to build a labelled dataset. This dataset allows the attacker to train a thief model that mimics the original model. To maximize query efficiency, the attacker has to select the most informative subset of data points from the pool of available data. Existing attack strategies utilize approaches like Active Learning and Semi-Supervised learning to minimize costs. However, in the black-box setting, these approaches may select sub-optimal samples as they train only one thief model. Depending on the thief model's capacity and the data it was pretrained on, the model might even select noisy samples that harm the learning process. In this work, we explore the usage of an ensemble of deep learning models as our thief model. We call our attack Army of Thieves(AOT) as we train multiple models with varying complexities to leverage the crowd's wisdom. Based on the ensemble's collective decision, uncertain samples are selected for querying, while the most confident samples are directly included in the training data. Our approach is the first one to utilize an ensemble of thief models to perform model extraction. We outperform the base approaches of existing state-of-the-art methods by at least 3% and achieve a 21% higher adversarial sample transferability than previous work for models trained on the CIFAR-10 dataset.

* 10 pages, 5 figures, paper accepted to WACV 2024

Via

Access Paper or Ask Questions

MWPRanker: An Expression Similarity Based Math Word Problem Retriever

Jul 03, 2023

Mayank Goel, Venktesh V, Vikram Goyal

Abstract:Math Word Problems (MWPs) in online assessments help test the ability of the learner to make critical inferences by interpreting the linguistic information in them. To test the mathematical reasoning capabilities of the learners, sometimes the problem is rephrased or the thematic setting of the original MWP is changed. Since manual identification of MWPs with similar problem models is cumbersome, we propose a tool in this work for MWP retrieval. We propose a hybrid approach to retrieve similar MWPs with the same problem model. In our work, the problem model refers to the sequence of operations to be performed to arrive at the solution. We demonstrate that our tool is useful for the mentioned tasks and better than semantic similarity-based approaches, which fail to capture the arithmetic and logical sequence of the MWPs. A demo of the tool can be found at https://www.youtube.com/watch?v=gSQWP3chFIs

* Accepted to ECML-PKDD 2023

Via

Access Paper or Ask Questions

Fusing Multimodal Signals on Hyper-complex Space for Extreme Abstractive Text Summarization (TL;DR) of Scientific Contents

Jun 24, 2023

Yash Kumar Atri, Vikram Goyal, Tanmoy Chakraborty

Abstract:The realm of scientific text summarization has experienced remarkable progress due to the availability of annotated brief summaries and ample data. However, the utilization of multiple input modalities, such as videos and audio, has yet to be thoroughly explored. At present, scientific multimodal-input-based text summarization systems tend to employ longer target summaries like abstracts, leading to an underwhelming performance in the task of text summarization. In this paper, we deal with a novel task of extreme abstractive text summarization (aka TL;DR generation) by leveraging multiple input modalities. To this end, we introduce mTLDR, a first-of-its-kind dataset for the aforementioned task, comprising videos, audio, and text, along with both author-composed summaries and expert-annotated summaries. The mTLDR dataset accompanies a total of 4,182 instances collected from various academic conference proceedings, such as ICLR, ACL, and CVPR. Subsequently, we present mTLDRgen, an encoder-decoder-based model that employs a novel dual-fused hyper-complex Transformer combined with a Wasserstein Riemannian Encoder Transformer, to dexterously capture the intricacies between different modalities in a hyper-complex latent geometric space. The hyper-complex Transformer captures the intrinsic properties between the modalities, while the Wasserstein Riemannian Encoder Transformer captures the latent structure of the modalities in the latent space geometry, thereby enabling the model to produce diverse sentences. mTLDRgen outperforms 20 baselines on mTLDR as well as another non-scientific dataset (How2) across three Rouge-based evaluation measures. Furthermore, based on the qualitative metrics, BERTScore and FEQA, and human evaluations, we demonstrate that the summaries generated by mTLDRgen are fluent and congruent to the original source material.

* Accepted to ADS-SIGKDD2023

Via

Access Paper or Ask Questions

Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment

Jun 15, 2023

Atharva Kulkarni, Sarah Masud, Vikram Goyal, Tanmoy Chakraborty

Figure 1 for Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment

Figure 2 for Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment

Figure 3 for Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment

Figure 4 for Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment

Abstract:Social media is awash with hateful content, much of which is often veiled with linguistic and topical diversity. The benchmark datasets used for hate speech detection do not account for such divagation as they are predominantly compiled using hate lexicons. However, capturing hate signals becomes challenging in neutrally-seeded malicious content. Thus, designing models and datasets that mimic the real-world variability of hate warrants further investigation. To this end, we present GOTHate, a large-scale code-mixed crowdsourced dataset of around 51k posts for hate speech detection from Twitter. GOTHate is neutrally seeded, encompassing different languages and topics. We conduct detailed comparisons of GOTHate with the existing hate speech datasets, highlighting its novelty. We benchmark it with 10 recent baselines. Our extensive empirical and benchmarking experiments suggest that GOTHate is hard to classify in a text-only setup. Thus, we investigate how adding endogenous signals enhances the hate speech detection task. We augment GOTHate with the user's timeline information and ego network, bringing the overall data source closer to the real-world setup for understanding hateful content. Our proposed solution HEN-mBERT is a modular, multilingual, mixture-of-experts model that enriches the linguistic subspace with latent endogenous signals from history, topology, and exemplars. HEN-mBERT transcends the best baseline by 2.5% and 5% in overall macro-F1 and hate class F1, respectively. Inspired by our experiments, in partnership with Wipro AI, we are developing a semi-automated pipeline to detect hateful content as a part of their mission to tackle online harm.

* 15 pages, 4 figures, 11 tables. Accepted at SIGKDD'23

Via

Access Paper or Ask Questions

Coherence and Diversity through Noise: Self-Supervised Paraphrase Generation via Structure-Aware Denoising

Feb 06, 2023

Rishabh Gupta, Venktesh V., Mukesh Mohania, Vikram Goyal

Abstract:In this paper, we propose SCANING, an unsupervised framework for paraphrasing via controlled noise injection. We focus on the novel task of paraphrasing algebraic word problems having practical applications in online pedagogy as a means to reduce plagiarism as well as ensure understanding on the part of the student instead of rote memorization. This task is more complex than paraphrasing general-domain corpora due to the difficulty in preserving critical information for solution consistency of the paraphrased word problem, managing the increased length of the text and ensuring diversity in the generated paraphrase. Existing approaches fail to demonstrate adequate performance on at least one, if not all, of these facets, necessitating the need for a more comprehensive solution. To this end, we model the noising search space as a composition of contextual and syntactic aspects and sample noising functions consisting of either one or both aspects. This allows for learning a denoising function that operates over both aspects and produces semantically equivalent and syntactically diverse outputs through grounded noise injection. The denoising function serves as a foundation for learning a paraphrasing function which operates solely in the input-paraphrase space without carrying any direct dependency on noise. We demonstrate SCANING considerably improves performance in terms of both semantic preservation and producing diverse paraphrases through extensive automated and manual evaluation across 4 datasets.

* 12 pages (main}; 22 pages in total

Via

Access Paper or Ask Questions

TagRec++: Hierarchical Label Aware Attention Network for Question Categorization

Aug 10, 2022

Venktesh Viswanathan, Mukesh Mohania, Vikram Goyal

Figure 1 for TagRec++: Hierarchical Label Aware Attention Network for Question Categorization

Figure 2 for TagRec++: Hierarchical Label Aware Attention Network for Question Categorization

Figure 3 for TagRec++: Hierarchical Label Aware Attention Network for Question Categorization

Figure 4 for TagRec++: Hierarchical Label Aware Attention Network for Question Categorization

Abstract:Online learning systems have multiple data repositories in the form of transcripts, books and questions. To enable ease of access, such systems organize the content according to a well defined taxonomy of hierarchical nature (subject-chapter-topic). The task of categorizing inputs to the hierarchical labels is usually cast as a flat multi-class classification problem. Such approaches ignore the semantic relatedness between the terms in the input and the tokens in the hierarchical labels. Alternate approaches also suffer from class imbalance when they only consider leaf level nodes as labels. To tackle the issues, we formulate the task as a dense retrieval problem to retrieve the appropriate hierarchical labels for each content. In this paper, we deal with categorizing questions. We model the hierarchical labels as a composition of their tokens and use an efficient cross-attention mechanism to fuse the information with the term representations of the content. We also propose an adaptive in-batch hard negative sampling approach which samples better negatives as the training progresses. We demonstrate that the proposed approach \textit{TagRec++} outperforms existing state-of-the-art approaches on question datasets as measured by Recall@k. In addition, we demonstrate zero-shot capabilities of \textit{TagRec++} and ability to adapt to label changes.

* 12 pages, double column, Under review at IEEE Transactions on Knwoledge and Data Engineering. arXiv admin note: text overlap with arXiv:2107.10649

Via

Access Paper or Ask Questions