Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Prathamesh Kalamkar

The Tokenization Bottleneck: How Vocabulary Extension Improves Chemistry Representation Learning in Pretrained Language Models

Nov 18, 2025

Prathamesh Kalamkar, Ned Letcher, Meissane Chami, Sahger Lad, Shayan Mohanty, Prasanna Pendse

Figure 1 for The Tokenization Bottleneck: How Vocabulary Extension Improves Chemistry Representation Learning in Pretrained Language Models

Figure 2 for The Tokenization Bottleneck: How Vocabulary Extension Improves Chemistry Representation Learning in Pretrained Language Models

Figure 3 for The Tokenization Bottleneck: How Vocabulary Extension Improves Chemistry Representation Learning in Pretrained Language Models

Figure 4 for The Tokenization Bottleneck: How Vocabulary Extension Improves Chemistry Representation Learning in Pretrained Language Models

Abstract:The application of large language models (LLMs) to chemistry is frequently hampered by a "tokenization bottleneck", where tokenizers tuned on general-domain text tend to fragment chemical representations such as SMILES into semantically uninformative sub-tokens. This paper introduces a principled methodology to resolve this bottleneck by unifying the representation of natural language and molecular structures within a single model. Our approach involves targeted vocabulary extension-augmenting a pretrained LLM's vocabulary with chemically salient tokens, followed by continued pretraining on chemistry-domain text to integrate this new knowledge. We provide an empirical demonstration of the effectiveness of this strategy, showing that our methodology leads to superior performance on a range of downstream chemical tasks.

Via

Access Paper or Ask Questions

Aalap: AI Assistant for Legal & Paralegal Functions in India

Jan 30, 2024

Aman Tiwari, Prathamesh Kalamkar, Atreyo Banerjee, Saurabh Karn, Varun Hemachandran, Smita Gupta

Abstract:Using proprietary Large Language Models on legal tasks poses challenges due to data privacy issues, domain data heterogeneity, domain knowledge sophistication, and domain objectives uniqueness. We created Aalalp, a fine-tuned Mistral 7B model on instructions data related to specific Indian legal tasks. The performance of Aalap is better than gpt-3.5-turbo in 31\% of our test data and obtains an equivalent score in 34\% of the test data as evaluated by GPT4. Training Aalap mainly focuses on teaching legal reasoning rather than legal recall. Aalap is definitely helpful for the day-to-day activities of lawyers, judges, or anyone working in legal systems.

Via

Access Paper or Ask Questions

SemEval 2023 Task 6: LegalEval - Understanding Legal Texts

May 01, 2023

Ashutosh Modi, Prathamesh Kalamkar, Saurabh Karn, Aman Tiwari, Abhinav Joshi, Sai Kiran Tanikella, Shouvik Kumar Guha, Sachin Malhan, Vivek Raghavan

Figure 1 for SemEval 2023 Task 6: LegalEval - Understanding Legal Texts

Figure 2 for SemEval 2023 Task 6: LegalEval - Understanding Legal Texts

Figure 3 for SemEval 2023 Task 6: LegalEval - Understanding Legal Texts

Figure 4 for SemEval 2023 Task 6: LegalEval - Understanding Legal Texts

Abstract:In populous countries, pending legal cases have been growing exponentially. There is a need for developing NLP-based techniques for processing and automatically understanding legal documents. To promote research in the area of Legal NLP we organized the shared task LegalEval - Understanding Legal Texts at SemEval 2023. LegalEval task has three sub-tasks: Task-A (Rhetorical Roles Labeling) is about automatically structuring legal documents into semantically coherent units, Task-B (Legal Named Entity Recognition) deals with identifying relevant entities in a legal document and Task-C (Court Judgement Prediction with Explanation) explores the possibility of automatically predicting the outcome of a legal case along with providing an explanation for the prediction. In total 26 teams (approx. 100 participants spread across the world) submitted systems paper. In each of the sub-tasks, the proposed systems outperformed the baselines; however, there is a lot of scope for improvement. This paper describes the tasks, and analyzes techniques proposed by various teams.

* 13 Pages (9 Pages + References), Accepted at SemEval 2023 at ACL 2023

Via

Access Paper or Ask Questions

Named Entity Recognition in Indian court judgments

Nov 07, 2022

Prathamesh Kalamkar, Astha Agarwal, Aman Tiwari, Smita Gupta, Saurabh Karn, Vivek Raghavan

Abstract:Identification of named entities from legal texts is an essential building block for developing other legal Artificial Intelligence applications. Named Entities in legal texts are slightly different and more fine-grained than commonly used named entities like Person, Organization, Location etc. In this paper, we introduce a new corpus of 46545 annotated legal named entities mapped to 14 legal entity types. The Baseline model for extracting legal named entities from judgment text is also developed.

* to be published in NLLP 2022 Workshop at EMNLP

Via

Access Paper or Ask Questions

Corpus for Automatic Structuring of Legal Documents

Jan 31, 2022

Prathamesh Kalamkar, Aman Tiwari, Astha Agarwal, Saurabh Karn, Smita Gupta, Vivek Raghavan, Ashutosh Modi

Figure 1 for Corpus for Automatic Structuring of Legal Documents

Figure 2 for Corpus for Automatic Structuring of Legal Documents

Figure 3 for Corpus for Automatic Structuring of Legal Documents

Figure 4 for Corpus for Automatic Structuring of Legal Documents

Abstract:In populous countries, pending legal cases have been growing exponentially. There is a need for developing techniques for processing and organizing legal documents. In this paper, we introduce a new corpus for structuring legal documents. In particular, we introduce a corpus of legal judgment documents in English that are segmented into topical and coherent parts. Each of these parts is annotated with a label coming from a list of pre-defined Rhetorical Roles. We develop baseline models for automatically predicting rhetorical roles in a legal document based on the annotated corpus. Further, we show the application of rhetorical roles to improve performance on the tasks of summarization and legal judgment prediction. We release the corpus and baseline model code along with the paper.

* 10 Pages (8 page main paper + 2 page references)

Via

Access Paper or Ask Questions

Indian Legal NLP Benchmarks : A Survey

Jul 13, 2021

Prathamesh Kalamkar, Janani Venugopalan Ph. D., Vivek Raghavan Ph. D

Figure 1 for Indian Legal NLP Benchmarks : A Survey

Figure 2 for Indian Legal NLP Benchmarks : A Survey

Figure 3 for Indian Legal NLP Benchmarks : A Survey

Figure 4 for Indian Legal NLP Benchmarks : A Survey

Abstract:Availability of challenging benchmarks is the key to advancement of AI in a specific field.Since Legal Text is significantly different than normal English text, there is a need to create separate Natural Language Processing benchmarks for Indian Legal Text which are challenging and focus on tasks specific to Legal Systems. This will spur innovation in applications of Natural language Processing for Indian Legal Text and will benefit AI community and Legal fraternity. We review the existing work in this area and propose ideas to create new benchmarks for Indian Legal Natural Language Processing.

Via

Access Paper or Ask Questions