Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Oluwatobi Olabiyi

Llama-Nemotron: Efficient Reasoning Models

May 02, 2025

Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani(+121 more)

Abstract:We introduce the Llama-Nemotron series of models, an open family of heterogeneous reasoning models that deliver exceptional reasoning capabilities, inference efficiency, and an open license for enterprise use. The family comes in three sizes -- Nano (8B), Super (49B), and Ultra (253B) -- and performs competitively with state-of-the-art reasoning models such as DeepSeek-R1 while offering superior inference throughput and memory efficiency. In this report, we discuss the training procedure for these models, which entails using neural architecture search from Llama 3 models for accelerated inference, knowledge distillation, and continued pretraining, followed by a reasoning-focused post-training stage consisting of two main parts: supervised fine-tuning and large scale reinforcement learning. Llama-Nemotron models are the first open-source models to support a dynamic reasoning toggle, allowing users to switch between standard chat and reasoning modes during inference. To further support open research and facilitate model development, we provide the following resources: 1. We release the Llama-Nemotron reasoning models -- LN-Nano, LN-Super, and LN-Ultra -- under the commercially permissive NVIDIA Open Model License Agreement. 2. We release the complete post-training dataset: Llama-Nemotron-Post-Training-Dataset. 3. We also release our training codebases: NeMo, NeMo-Aligner, and Megatron-LM.

Via

Access Paper or Ask Questions

Efficient Hybrid Language Model Compression through Group-Aware SSM Pruning

Apr 15, 2025

Ali Taghibakhshi, Sharath Turuvekere Sreenivas, Saurav Muralidharan, Marcin Chochowski, Yashaswi Karnati, Raviraj Joshi, Ameya Sunil Mahabaleshwarkar, Zijia Chen, Yoshi Suhara, Oluwatobi Olabiyi(+8 more)

Abstract:Hybrid LLM architectures that combine Attention and State Space Models (SSMs) achieve state-of-the-art accuracy and runtime performance. Recent work has demonstrated that applying compression and distillation to Attention-only models yields smaller, more accurate models at a fraction of the training cost. In this work, we explore the effectiveness of compressing Hybrid architectures. We introduce a novel group-aware pruning strategy that preserves the structural integrity of SSM blocks and their sequence modeling capabilities. Furthermore, we demonstrate the necessity of such SSM pruning to achieve improved accuracy and inference speed compared to traditional approaches. Our compression recipe combines SSM, FFN, embedding dimension, and layer pruning, followed by knowledge distillation-based retraining, similar to the MINITRON technique. Using this approach, we compress the Nemotron-H 8B Hybrid model down to 4B parameters with up to 40x fewer training tokens. The resulting model surpasses the accuracy of similarly-sized models while achieving 2x faster inference, significantly advancing the Pareto frontier.

Via

Access Paper or Ask Questions

DLGNet: A Transformer-based Model for Dialogue Response Generation

Sep 04, 2019

Oluwatobi Olabiyi, Erik T. Mueller

Figure 1 for DLGNet: A Transformer-based Model for Dialogue Response Generation

Figure 2 for DLGNet: A Transformer-based Model for Dialogue Response Generation

Figure 3 for DLGNet: A Transformer-based Model for Dialogue Response Generation

Figure 4 for DLGNet: A Transformer-based Model for Dialogue Response Generation

Abstract:Neural dialogue models, despite their successes, still suffer from lack of relevance, diversity, and in many cases coherence in their generated responses. These issues can attributed to reasons including (1) short-range model architectures that capture limited temporal dependencies, (2) limitations of the maximum likelihood training objective, (3) the concave entropy profile of dialogue datasets resulting in short and generic responses, and (4) the out-of-vocabulary problem leading to generation of a large number of <UNK> tokens. On the other hand, transformer-based models such as GPT-2 have demonstrated an excellent ability to capture long-range structures in language modeling tasks. In this paper, we present DLGNet, a transformer-based model for dialogue modeling. We specifically examine the use of DLGNet for multi-turn dialogue response generation. In our experiments, we evaluate DLGNet on the open-domain Movie Triples dataset and the closed-domain Ubuntu Dialogue dataset. DLGNet models, although trained with only the maximum likelihood objective, achieve significant improvements over state-of-the-art multi-turn dialogue models. They also produce best performance to date on the two datasets based on several metrics, including BLEU, ROUGE, and distinct n-gram. Our analysis shows that the performance improvement is mostly due to the combination of (1) the long-range transformer architecture with (2) the injection of random informative paddings. Other contributing factors include the joint modeling of dialogue context and response, and the 100% tokenization coverage from the byte pair encoding (BPE).

Via

Access Paper or Ask Questions

Adversarial Bootstrapping for Dialogue Model Training

Sep 04, 2019

Oluwatobi Olabiyi, Erik T. Mueller, Christopher Larson, Tarek Lahlou

Figure 1 for Adversarial Bootstrapping for Dialogue Model Training

Figure 2 for Adversarial Bootstrapping for Dialogue Model Training

Figure 3 for Adversarial Bootstrapping for Dialogue Model Training

Figure 4 for Adversarial Bootstrapping for Dialogue Model Training

Abstract:Open domain neural dialogue models, despite their successes, are known to produce responses that lack relevance, diversity, and in many cases coherence. These shortcomings stem from the limited ability of common training objectives to directly express these properties as well as their interplay with training datasets and model architectures. Toward addressing these problems, this paper proposes bootstrapping a dialogue response generator with an adversarially trained discriminator. The method involves training a neural generator in both autoregressive and traditional teacher-forcing modes, with the maximum likelihood loss of the auto-regressive outputs weighted by the score from a metric-based discriminator model. The discriminator input is a mixture of ground truth labels, the teacher-forcing outputs of the generator, and distractors sampled from the dataset, thereby allowing for richer feedback on the autoregressive outputs of the generator. To improve the calibration of the discriminator output, we also bootstrap the discriminator with the matching of the intermediate features of the ground truth and the generator's autoregressive output. We explore different sampling and adversarial policy optimization strategies during training in order to understand how to encourage response diversity without sacrificing relevance. Our experiments shows that adversarial bootstrapping is effective at addressing exposure bias, leading to improvement in response relevance and coherence. The improvement is demonstrated with the state-of-the-art results on the Movie and Ubuntu dialogue datasets with respect to human evaluations and BLUE, ROGUE, and distinct n-gram scores.

Via

Access Paper or Ask Questions

An Adversarial Learning Framework For A Persona-Based Multi-Turn Dialogue Model

Apr 29, 2019

Oluwatobi Olabiyi, Anish Khazane, Alan Salimov, Erik T. Mueller

Figure 1 for An Adversarial Learning Framework For A Persona-Based Multi-Turn Dialogue Model

Figure 2 for An Adversarial Learning Framework For A Persona-Based Multi-Turn Dialogue Model

Figure 3 for An Adversarial Learning Framework For A Persona-Based Multi-Turn Dialogue Model

Figure 4 for An Adversarial Learning Framework For A Persona-Based Multi-Turn Dialogue Model

Abstract:In this paper, we extend the persona-based sequence-to-sequence (Seq2Seq) neural network conversation model to a multi-turn dialogue scenario by modifying the state-of-the-art hredGAN architecture to simultaneously capture utterance attributes such as speaker identity, dialogue topic, speaker sentiments and so on. The proposed system, phredGAN has a persona-based HRED generator (PHRED) and a conditional discriminator. We also explore two approaches to accomplish the conditional discriminator: (1) phredGAN_a, a system that passes the attribute representation as an additional input into a traditional adversarial discriminator, and (2) phredGAN_d, a dual discriminator system which in addition to the adversarial discriminator, collaboratively predicts the attribute(s) that generated the input utterance. To demonstrate the superior performance of phredGAN over the persona Seq2Seq model, we experiment with two conversational datasets, the Ubuntu Dialogue Corpus (UDC) and TV series transcripts from the Big Bang Theory and Friends. Performance comparison is made with respect to a variety of quantitative measures as well as crowd-sourced human evaluation. We also explore the trade-offs from using either variant of phredGAN on datasets with many but weak attribute modalities (such as with Big Bang Theory and Friends) and ones with few but strong attribute modalities (customer-agent interactions in Ubuntu dataset).

* NAACL NeuralGen Workshop 2019. arXiv admin note: substantial text overlap with arXiv:1905.01998

Via

Access Paper or Ask Questions

Multi-turn Dialogue Response Generation in an Adversarial Learning Framework

Sep 19, 2018

Oluwatobi Olabiyi, Alan Salimov, Anish Khazane, Erik T. Mueller

Figure 1 for Multi-turn Dialogue Response Generation in an Adversarial Learning Framework

Figure 2 for Multi-turn Dialogue Response Generation in an Adversarial Learning Framework

Figure 3 for Multi-turn Dialogue Response Generation in an Adversarial Learning Framework

Figure 4 for Multi-turn Dialogue Response Generation in an Adversarial Learning Framework

Abstract:We propose an adversarial learning approach to the generation of multi-turn dialogue responses. Our proposed framework, hredGAN, is based on conditional generative adversarial networks (GANs). The GAN's generator is a modified hierarchical recurrent encoder-decoder network (HRED) and the discriminator is a word-level bidirectional RNN that shares context and word embedding with the generator. During inference, noise samples conditioned on the dialogue history are used to perturb the generator's latent space to generate several possible responses. The final response is the one ranked best by the discriminator. The hredGAN shows major advantages over existing methods: (1) it generalizes better than networks trained using only the log-likelihood criterion, and (2) it generates longer, more informative and more diverse responses with high utterance and topic relevance even with limited training data. This superiority is demonstrated on the Movie triples and Ubuntu dialogue datasets in terms of perplexity, BLEU, ROUGE and Distinct n-gram scores.

Via

Access Paper or Ask Questions

Driver Action Prediction Using Deep (Bidirectional) Recurrent Neural Network

Jun 07, 2017

Oluwatobi Olabiyi, Eric Martinson, Vijay Chintalapudi, Rui Guo

Figure 1 for Driver Action Prediction Using Deep (Bidirectional) Recurrent Neural Network

Figure 2 for Driver Action Prediction Using Deep (Bidirectional) Recurrent Neural Network

Figure 3 for Driver Action Prediction Using Deep (Bidirectional) Recurrent Neural Network

Figure 4 for Driver Action Prediction Using Deep (Bidirectional) Recurrent Neural Network

Abstract:Advanced driver assistance systems (ADAS) can be significantly improved with effective driver action prediction (DAP). Predicting driver actions early and accurately can help mitigate the effects of potentially unsafe driving behaviors and avoid possible accidents. In this paper, we formulate driver action prediction as a timeseries anomaly prediction problem. While the anomaly (driver actions of interest) detection might be trivial in this context, finding patterns that consistently precede an anomaly requires searching for or extracting features across multi-modal sensory inputs. We present such a driver action prediction system, including a real-time data acquisition, processing and learning framework for predicting future or impending driver action. The proposed system incorporates camera-based knowledge of the driving environment and the driver themselves, in addition to traditional vehicle dynamics. It then uses a deep bidirectional recurrent neural network (DBRNN) to learn the correlation between sensory inputs and impending driver behavior achieving accurate and high horizon action prediction. The proposed system performs better than other existing systems on driver action prediction tasks and can accurately predict key driver actions including acceleration, braking, lane change and turning at durations of 5sec before the action is executed by the driver.

* ITSC'17

Via

Access Paper or Ask Questions