Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kunal Chawla

Sid

The Llama 4 Herd: Architecture, Training, Evaluation, and Deployment Notes

Jan 15, 2026

Aaron Adcock, Aayushi Srivastava, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pande, Abhinav Pandey, Abhinav Sharma, Abhishek Kadian, Abhishek Kumawat, Adam Kelsey(+1295 more)

Abstract:This document consolidates publicly reported technical details about Metas Llama 4 model family. It summarizes (i) released variants (Scout and Maverick) and the broader herd context including the previewed Behemoth teacher model, (ii) architectural characteristics beyond a high-level MoE description covering routed/shared-expert structure, early-fusion multimodality, and long-context design elements reported for Scout (iRoPE and length generalization strategies), (iii) training disclosures spanning pre-training, mid-training for long-context extension, and post-training methodology (lightweight SFT, online RL, and lightweight DPO) as described in release materials, (iv) developer-reported benchmark results for both base and instruction-tuned checkpoints, and (v) practical deployment constraints observed across major serving environments, including provider-specific context limits and quantization packaging. The manuscript also summarizes licensing obligations relevant to redistribution and derivative naming, and reviews publicly described safeguards and evaluation practices. The goal is to provide a compact technical reference for researchers and practitioners who need precise, source-backed facts about Llama 4.

* 15 pages

Via

Access Paper or Ask Questions

The Llama 3 Herd of Models

Jul 31, 2024

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan(+521 more)

Abstract:Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development.

Via

Access Paper or Ask Questions

WHEN FLUE MEETS FLANG: Benchmarks and Large Pre-trained Language Model for Financial Domain

Oct 31, 2022

Raj Sanjay Shah, Kunal Chawla, Dheeraj Eidnani, Agam Shah, Wendi Du, Sudheer Chava, Natraj Raman, Charese Smiley, Jiaao Chen, Diyi Yang

Figure 1 for WHEN FLUE MEETS FLANG: Benchmarks and Large Pre-trained Language Model for Financial Domain

Figure 2 for WHEN FLUE MEETS FLANG: Benchmarks and Large Pre-trained Language Model for Financial Domain

Figure 3 for WHEN FLUE MEETS FLANG: Benchmarks and Large Pre-trained Language Model for Financial Domain

Figure 4 for WHEN FLUE MEETS FLANG: Benchmarks and Large Pre-trained Language Model for Financial Domain

Abstract:Pre-trained language models have shown impressive performance on a variety of tasks and domains. Previous research on financial language models usually employs a generic training scheme to train standard model architectures, without completely leveraging the richness of the financial data. We propose a novel domain specific Financial LANGuage model (FLANG) which uses financial keywords and phrases for better masking, together with span boundary objective and in-filing objective. Additionally, the evaluation benchmarks in the field have been limited. To this end, we contribute the Financial Language Understanding Evaluation (FLUE), an open-source comprehensive suite of benchmarks for the financial domain. These include new benchmarks across 5 NLP tasks in financial domain as well as common benchmarks used in the previous research. Experiments on these benchmarks suggest that our model outperforms those in prior literature on a variety of NLP tasks. Our models, code and benchmark data are publicly available on Github and Huggingface.

Via

Access Paper or Ask Questions

Semi-supervised Formality Style Transfer using Language Model Discriminator and Mutual Information Maximization

Oct 10, 2020

Kunal Chawla, Diyi Yang

Figure 1 for Semi-supervised Formality Style Transfer using Language Model Discriminator and Mutual Information Maximization

Figure 2 for Semi-supervised Formality Style Transfer using Language Model Discriminator and Mutual Information Maximization

Figure 3 for Semi-supervised Formality Style Transfer using Language Model Discriminator and Mutual Information Maximization

Figure 4 for Semi-supervised Formality Style Transfer using Language Model Discriminator and Mutual Information Maximization

Abstract:Formality style transfer is the task of converting informal sentences to grammatically-correct formal sentences, which can be used to improve performance of many downstream NLP tasks. In this work, we propose a semi-supervised formality style transfer model that utilizes a language model-based discriminator to maximize the likelihood of the output sentence being formal, which allows us to use maximization of token-level conditional probabilities for training. We further propose to maximize mutual information between source and target styles as our training objective instead of maximizing the regular likelihood that often leads to repetitive and trivial generated responses. Experiments showed that our model outperformed previous state-of-the-art baselines significantly in terms of both automated metrics and human judgement. We further generalized our model to unsupervised text style transfer task, and achieved significant improvements on two benchmark sentiment style transfer datasets.

* EMNLP 2020 Findings

Via

Access Paper or Ask Questions

Attention-based Ensemble for Deep Metric Learning

Aug 31, 2018

Wonsik Kim, Bhavya Goyal, Kunal Chawla, Jungmin Lee, Keunjoo Kwon

Figure 1 for Attention-based Ensemble for Deep Metric Learning

Figure 2 for Attention-based Ensemble for Deep Metric Learning

Figure 3 for Attention-based Ensemble for Deep Metric Learning

Figure 4 for Attention-based Ensemble for Deep Metric Learning

Abstract:Deep metric learning aims to learn an embedding function, modeled as deep neural network. This embedding function usually puts semantically similar images close while dissimilar images far from each other in the learned embedding space. Recently, ensemble has been applied to deep metric learning to yield state-of-the-art results. As one important aspect of ensemble, the learners should be diverse in their feature embeddings. To this end, we propose an attention-based ensemble, which uses multiple attention masks, so that each learner can attend to different parts of the object. We also propose a divergence loss, which encourages diversity among the learners. The proposed method is applied to the standard benchmarks of deep metric learning and experimental results show that it outperforms the state-of-the-art methods by a significant margin on image retrieval tasks.

* ECCV 2018 camera-ready

Via

Access Paper or Ask Questions

Classification of Flames in Computer Mediated Communications

Feb 17, 2012

Nitin, Ankush Bansal, Siddhartha Mahadev Sharma, Kapil Kumar, Anuj Aggarwal, Sheenu Goyal, Kanika Choudhary, Kunal Chawla, Kunal Jain, Manav Bhasin

Figure 1 for Classification of Flames in Computer Mediated Communications

Figure 2 for Classification of Flames in Computer Mediated Communications

Figure 3 for Classification of Flames in Computer Mediated Communications

Figure 4 for Classification of Flames in Computer Mediated Communications

Abstract:Computer Mediated Communication (CMC) has brought about a revolution in the way the world communicates with each other. With the increasing number of people, interacting through the internet and the rise of new platforms and technologies has brought together the people from different social, cultural and geographical backgrounds to present their thoughts, ideas and opinions on topics of their interest. CMC has, in some cases, gave users more freedom to express themselves as compared to Face-to-face communication. This has also led to rise in the use of hostile and aggressive language and terminologies uninhibitedly. Since such use of language is detrimental to the discussion process and affects the audience and individuals negatively, efforts are being taken to control them. The research sees the need to understand the concept of flaming and hence attempts to classify them in order to give a better understanding of it. The classification is done on the basis of type of flame content being presented and the Style in which they are presented.

* International Journal of Computer Applications (0975-8887), Volume 14 - No.6, February 2011
* 6 pages, 4 figures

Via

Access Paper or Ask Questions