Abstract:Hybrid LLM architectures that combine Attention and State Space Models (SSMs) achieve state-of-the-art accuracy and runtime performance. Recent work has demonstrated that applying compression and distillation to Attention-only models yields smaller, more accurate models at a fraction of the training cost. In this work, we explore the effectiveness of compressing Hybrid architectures. We introduce a novel group-aware pruning strategy that preserves the structural integrity of SSM blocks and their sequence modeling capabilities. Furthermore, we demonstrate the necessity of such SSM pruning to achieve improved accuracy and inference speed compared to traditional approaches. Our compression recipe combines SSM, FFN, embedding dimension, and layer pruning, followed by knowledge distillation-based retraining, similar to the MINITRON technique. Using this approach, we compress the Nemotron-H 8B Hybrid model down to 4B parameters with up to 40x fewer training tokens. The resulting model surpasses the accuracy of similarly-sized models while achieving 2x faster inference, significantly advancing the Pareto frontier.
Abstract:As inference-time scaling becomes critical for enhanced reasoning capabilities, it is increasingly becoming important to build models that are efficient to infer. We introduce Nemotron-H, a family of 8B and 56B/47B hybrid Mamba-Transformer models designed to reduce inference cost for a given accuracy level. To achieve this goal, we replace the majority of self-attention layers in the common Transformer model architecture with Mamba layers that perform constant computation and require constant memory per generated token. We show that Nemotron-H models offer either better or on-par accuracy compared to other similarly-sized state-of-the-art open-sourced Transformer models (e.g., Qwen-2.5-7B/72B and Llama-3.1-8B/70B), while being up to 3$\times$ faster at inference. To further increase inference speed and reduce the memory required at inference time, we created Nemotron-H-47B-Base from the 56B model using a new compression via pruning and distillation technique called MiniPuzzle. Nemotron-H-47B-Base achieves similar accuracy to the 56B model, but is 20% faster to infer. In addition, we introduce an FP8-based training recipe and show that it can achieve on par results with BF16-based training. This recipe is used to train the 56B model. All Nemotron-H models will be released, with support in Hugging Face, NeMo, and Megatron-LM.
Abstract:While topic modeling in English has become a prevalent and well-explored area, venturing into topic modeling for Indic languages remains relatively rare. The limited availability of resources, diverse linguistic structures, and unique challenges posed by Indic languages contribute to the scarcity of research and applications in this domain. Despite the growing interest in natural language processing and machine learning, there exists a noticeable gap in the comprehensive exploration of topic modeling methodologies tailored specifically for languages such as Hindi, Marathi, Tamil, and others. In this paper, we examine several topic modeling approaches applied to the Marathi language. Specifically, we compare various BERT and non-BERT approaches, including multilingual and monolingual BERT models, using topic coherence and topic diversity as evaluation metrics. Our analysis provides insights into the performance of these approaches for Marathi language topic modeling. The key finding of the paper is that BERTopic, when combined with BERT models trained on Indic languages, outperforms LDA in terms of topic modeling performance.
Abstract:This study explores the effectiveness of layer pruning for developing more efficient BERT models tailored to specific downstream tasks in low-resource languages. Our primary objective is to evaluate whether pruned BERT models can maintain high performance while reducing model size and complexity. We experiment with several BERT variants, including MahaBERT-v2 and Google-Muril, applying different pruning strategies and comparing their performance to smaller, scratch-trained models like MahaBERT-Small and MahaBERT-Smaller. We fine-tune these models on Marathi datasets, specifically Short Headlines Classification (SHC), Long Paragraph Classification (LPC) and Long Document Classification (LDC), to assess their classification accuracy. Our findings demonstrate that pruned models, despite having fewer layers, achieve comparable performance to their fully-layered counterparts while consistently outperforming scratch-trained models of similar size. Notably, pruning layers from the middle of the model proves to be the most effective strategy, offering performance competitive with pruning from the top and bottom. However, there is no clear winner, as different pruning strategies perform better in different model and dataset combinations. Additionally, monolingual BERT models outperform multilingual ones in these experiments. This approach, which reduces computational demands, provides a faster and more efficient alternative to training smaller models from scratch, making advanced NLP models more accessible for low-resource languages without compromising classification accuracy.
Abstract:Large Language Models (LLMs) have demonstrated remarkable multilingual capabilities, yet challenges persist in adapting these models for low-resource languages. In this study, we investigate the effects of Low-Rank Adaptation (LoRA) Parameter-Efficient Fine-Tuning (PEFT) on multilingual Gemma models for Marathi, a language with limited resources. Using a translated Alpaca dataset with 52,000 instruction-response pairs, our findings reveal that while evaluation metrics often show a performance decline post-fine-tuning, manual assessments frequently suggest that the fine-tuned models outperform their original counterparts. The observations indicate improvements in target language generation capabilities but a reduction in reasoning abilities following language adaptation. These results underscore the need for improved evaluation methodologies and the creation of high-quality native datasets to accurately assess language-specific model performance in low-resource settings.
Abstract:Code-mixing is the practice of using two or more languages in a single sentence, which often occurs in multilingual communities such as India where people commonly speak multiple languages. Classic NLP tools, trained on monolingual data, face challenges when dealing with code-mixed data. Extracting meaningful information from sentences containing multiple languages becomes difficult, particularly in tasks like hate speech detection, due to linguistic variation, cultural nuances, and data sparsity. To address this, we aim to analyze the significance of code-mixed embeddings and evaluate the performance of BERT and HingBERT models (trained on a Hindi-English corpus) in hate speech detection. Our study demonstrates that HingBERT models, benefiting from training on the extensive Hindi-English dataset L3Cube-HingCorpus, outperform BERT models when tested on hate speech text datasets. We also found that code-mixed Hing-FastText performs better than standard English FastText and vanilla BERT models.
Abstract:Low-resource languages face significant challenges due to the lack of sufficient linguistic data, resources, and tools for tasks such as supervised learning, annotation, and classification. This shortage hinders the development of accurate models and datasets, making it difficult to perform critical NLP tasks like sentiment analysis or hate speech detection. To bridge this gap, Large Language Models (LLMs) present an opportunity for potential annotators, capable of generating datasets and resources for these underrepresented languages. In this paper, we focus on Marathi, a low-resource language, and evaluate the performance of both closed-source and open-source LLMs as annotators. We assess models such as GPT-4o and Gemini 1.0 Pro, Gemma 2 (2B and 9B), and Llama 3.1 (8B) on classification tasks including sentiment analysis, news classification, and hate speech detection. Our findings reveal that while LLMs excel in annotation tasks for high-resource languages like English, they still fall short when applied to Marathi. Even advanced closed models like Gemini and GPT underperform in comparison to BERT-based baselines, highlighting the limitations of LLMs as annotators for low-resource languages.
Abstract:Natural Language Processing (NLP) for low-resource languages presents significant challenges, particularly due to the scarcity of high-quality annotated data and linguistic resources. The choice of embeddings plays a critical role in enhancing the performance of NLP tasks, such as news classification, sentiment analysis, and hate speech detection, especially for low-resource languages like Marathi. In this study, we investigate the impact of various embedding techniques- Contextual BERT-based, Non-Contextual BERT-based, and FastText-based on NLP classification tasks specific to the Marathi language. Our research includes a thorough evaluation of both compressed and uncompressed embeddings, providing a comprehensive overview of how these embeddings perform across different scenarios. Specifically, we compare two BERT model embeddings, Muril and MahaBERT, as well as two FastText model embeddings, IndicFT and MahaFT. Our evaluation includes applying embeddings to a Multiple Logistic Regression (MLR) classifier for task performance assessment, as well as TSNE visualizations to observe the spatial distribution of these embeddings. The results demonstrate that contextual embeddings outperform non-contextual embeddings. Furthermore, BERT-based non-contextual embeddings extracted from the first BERT embedding layer yield better results than FastText-based embeddings, suggesting a potential alternative to FastText embeddings.
Abstract:The demand for sophisticated natural language processing (NLP) methods, particularly Named Entity Recognition (NER), has increased due to the exponential growth of Marathi-language digital content. In particular, NER is essential for recognizing distant entities and for arranging and understanding unstructured Marathi text data. With an emphasis on managing long-range entities, this paper offers a comprehensive analysis of current NER techniques designed for Marathi documents. It dives into current practices and investigates the BERT transformer model's potential for long-range Marathi NER. Along with analyzing the effectiveness of earlier methods, the report draws comparisons between NER in English literature and suggests adaptation strategies for Marathi literature. The paper discusses the difficulties caused by Marathi's particular linguistic traits and contextual subtleties while acknowledging NER's critical role in NLP. To conclude, this project is a major step forward in improving Marathi NER techniques, with potential wider applications across a range of NLP tasks and domains.
Abstract:We present the MahaSUM dataset, a large-scale collection of diverse news articles in Marathi, designed to facilitate the training and evaluation of models for abstractive summarization tasks in Indic languages. The dataset, containing 25k samples, was created by scraping articles from a wide range of online news sources and manually verifying the abstract summaries. Additionally, we train an IndicBART model, a variant of the BART model tailored for Indic languages, using the MahaSUM dataset. We evaluate the performance of our trained models on the task of abstractive summarization and demonstrate their effectiveness in producing high-quality summaries in Marathi. Our work contributes to the advancement of natural language processing research in Indic languages and provides a valuable resource for future research in this area using state-of-the-art models. The dataset and models are shared publicly at https://github.com/l3cube-pune/MarathiNLP