Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Anil Thomas

Essential-Web v1.0: 24T tokens of organized web data

Jun 17, 2025

Essential AI, :, Andrew Hojel, Michael Pust, Tim Romanski, Yash Vanjani, Ritvik Kapila, Mohit Parmar, Adarsh Chaluvaraju, Alok Tripathy(+15 more)

Figure 1 for Essential-Web v1.0: 24T tokens of organized web data

Figure 2 for Essential-Web v1.0: 24T tokens of organized web data

Figure 3 for Essential-Web v1.0: 24T tokens of organized web data

Figure 4 for Essential-Web v1.0: 24T tokens of organized web data

Abstract:Data plays the most prominent role in how language models acquire skills and knowledge. The lack of massive, well-organized pre-training datasets results in costly and inaccessible data pipelines. We present Essential-Web v1.0, a 24-trillion-token dataset in which every document is annotated with a twelve-category taxonomy covering topic, format, content complexity, and quality. Taxonomy labels are produced by EAI-Distill-0.5b, a fine-tuned 0.5b-parameter model that achieves an annotator agreement within 3% of Qwen2.5-32B-Instruct. With nothing more than SQL-style filters, we obtain competitive web-curated datasets in math (-8.0% relative to SOTA), web code (+14.3%), STEM (+24.5%) and medical (+8.6%). Essential-Web v1.0 is available on HuggingFace: https://huggingface.co/datasets/EssentialAI/essential-web-v1.0

Via

Access Paper or Ask Questions

Practical Efficiency of Muon for Pretraining

May 04, 2025

Essential AI, :, Ishaan Shah, Anthony M. Polloreno, Karl Stratos, Philip Monk, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas(+15 more)

Abstract:We demonstrate that Muon, the simplest instantiation of a second-order optimizer, explicitly expands the Pareto frontier over AdamW on the compute-time tradeoff. We find that Muon is more effective than AdamW in retaining data efficiency at large batch sizes, far beyond the so-called critical batch size, while remaining computationally efficient, thus enabling more economical training. We study the combination of Muon and the maximal update parameterization (muP) for efficient hyperparameter transfer and present a simple telescoping algorithm that accounts for all sources of error in muP while introducing only a modest overhead in resources. We validate our findings through extensive experiments with model sizes up to four billion parameters and ablations on the data distribution and architecture.

Via

Access Paper or Ask Questions

Rethinking Reflection in Pre-Training

Apr 05, 2025

Essential AI, :, Darsh J Shah, Peter Rushton, Somanshu Singla, Mohit Parmar, Kurt Smith, Yash Vanjani, Ashish Vaswani, Adarsh Chaluvaraju(+19 more)

Figure 1 for Rethinking Reflection in Pre-Training

Figure 2 for Rethinking Reflection in Pre-Training

Figure 3 for Rethinking Reflection in Pre-Training

Figure 4 for Rethinking Reflection in Pre-Training

Abstract:A language model's ability to reflect on its own reasoning provides a key advantage for solving complex problems. While most recent research has focused on how this ability develops during reinforcement learning, we show that it actually begins to emerge much earlier - during the model's pre-training. To study this, we introduce deliberate errors into chains-of-thought and test whether the model can still arrive at the correct answer by recognizing and correcting these mistakes. By tracking performance across different stages of pre-training, we observe that this self-correcting ability appears early and improves steadily over time. For instance, an OLMo2-7B model pre-trained on 4 trillion tokens displays self-correction on our six self-reflection tasks.

Via

Access Paper or Ask Questions

Semi-supervised voice conversion with amortized variational inference

Sep 30, 2019

Cory Stephenson, Gokce Keskin, Anil Thomas, Oguz H. Elibol

Figure 1 for Semi-supervised voice conversion with amortized variational inference

Figure 2 for Semi-supervised voice conversion with amortized variational inference

Figure 3 for Semi-supervised voice conversion with amortized variational inference

Abstract:In this work we introduce a semi-supervised approach to the voice conversion problem, in which speech from a source speaker is converted into speech of a target speaker. The proposed method makes use of both parallel and non-parallel utterances from the source and target simultaneously during training. This approach can be used to extend existing parallel data voice conversion systems such that they can be trained with semi-supervision. We show that incorporating semi-supervision improves the voice conversion performance compared to fully supervised training when the number of parallel utterances is limited as in many practical applications. Additionally, we find that increasing the number non-parallel utterances used in training continues to improve performance when the amount of parallel training data is held constant.

* Proc. Interspeech 2019 (2019): 729-733
* Accepted for publication at Interspeech 2019

Via

Access Paper or Ask Questions

Semi-supervised and Population Based Training for Voice Commands Recognition

May 10, 2019

Oguz H. Elibol, Gokce Keskin, Anil Thomas

Figure 1 for Semi-supervised and Population Based Training for Voice Commands Recognition

Figure 2 for Semi-supervised and Population Based Training for Voice Commands Recognition

Figure 3 for Semi-supervised and Population Based Training for Voice Commands Recognition

Figure 4 for Semi-supervised and Population Based Training for Voice Commands Recognition

Abstract:We present a rapid design methodology that combines automated hyper-parameter tuning with semi-supervised training to build highly accurate and robust models for voice commands classification. Proposed approach allows quick evaluation of network architectures to fit performance and power constraints of available hardware, while ensuring good hyper-parameter choices for each network in real-world scenarios. Leveraging the vast amount of unlabeled data with a student/teacher based semi-supervised method, classification accuracy is improved from 84% to 94% in the validation set. For model optimization, we explore the hyper-parameter space through population based training and obtain an optimized model in the same time frame as it takes to train a single model.

* ICASSP 2019

Via

Access Paper or Ask Questions

Adversarially Trained Autoencoders for Parallel-Data-Free Voice Conversion

May 09, 2019

Orhan Ocal, Oguz H. Elibol, Gokce Keskin, Cory Stephenson, Anil Thomas, Kannan Ramchandran

Figure 1 for Adversarially Trained Autoencoders for Parallel-Data-Free Voice Conversion

Figure 2 for Adversarially Trained Autoencoders for Parallel-Data-Free Voice Conversion

Figure 3 for Adversarially Trained Autoencoders for Parallel-Data-Free Voice Conversion

Figure 4 for Adversarially Trained Autoencoders for Parallel-Data-Free Voice Conversion

Abstract:We present a method for converting the voices between a set of speakers. Our method is based on training multiple autoencoder paths, where there is a single speaker-independent encoder and multiple speaker-dependent decoders. The autoencoders are trained with an addition of an adversarial loss which is provided by an auxiliary classifier in order to guide the output of the encoder to be speaker independent. The training of the model is unsupervised in the sense that it does not require collecting the same utterances from the speakers nor does it require time aligning over phonemes. Due to the use of a single encoder, our method can generalize to converting the voice of out-of-training speakers to speakers in the training dataset. We present subjective tests corroborating the performance of our method.

Via

Access Paper or Ask Questions