Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sergey Troshin

HSE University, Russia

ARM: Efficient Guided Decoding with Autoregressive Reward Models

Jul 05, 2024

Sergey Troshin, Vlad Niculae, Antske Fokkens

Abstract:Language models trained on large amounts of data require careful tuning to be safely deployed in real world. We revisit the guided decoding paradigm, where the goal is to augment the logits of the base language model using the scores from a task-specific reward model. We propose a simple but efficient parameterization of the autoregressive reward model enabling fast and effective guided decoding. On detoxification and sentiment control tasks, we show that our efficient parameterization performs on par with RAD, a strong but less efficient guided decoding approach.

Via

Access Paper or Ask Questions

CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code

Aug 01, 2023

Nadezhda Chirkova, Sergey Troshin

Abstract:Recent works have widely adopted large language model pretraining for source code, suggested source code-specific pretraining objectives and investigated the applicability of various Transformer-based language model architectures for source code. This work investigates another important aspect of such models, namely the effect of different subtokenization options, and aims at identifying most effective and length-efficient subtokenizations, taking into account code specifics. We propose subtokenziation that reduces average length by 17% without downstream performance drop, and show that a carefully chosen subtokenization may improve quality by 0.5-2%, possibly with some length increase.

* Published at ICLR 2023

Via

Access Paper or Ask Questions

SantaCoder: don't reach for the stars!

Jan 09, 2023

Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey(+31 more)

Figure 1 for SantaCoder: don't reach for the stars!

Figure 2 for SantaCoder: don't reach for the stars!

Figure 3 for SantaCoder: don't reach for the stars!

Figure 4 for SantaCoder: don't reach for the stars!

Abstract:The BigCode project is an open-scientific collaboration working on the responsible development of large language models for code. This tech report describes the progress of the collaboration until December 2022, outlining the current state of the Personally Identifiable Information (PII) redaction pipeline, the experiments conducted to de-risk the model architecture, and the experiments investigating better preprocessing methods for the training data. We train 1.1B parameter models on the Java, JavaScript, and Python subsets of The Stack and evaluate them on the MultiPL-E text-to-code benchmark. We find that more aggressive filtering of near-duplicates can further boost performance and, surprisingly, that selecting files from repositories with 5+ GitHub stars deteriorates performance significantly. Our best model outperforms previous open-source multilingual code generation models (InCoder-6.7B and CodeGen-Multi-2.7B) in both left-to-right generation and infilling on the Java, JavaScript, and Python portions of MultiPL-E, despite being a substantially smaller model. All models are released under an OpenRAIL license at https://hf.co/bigcode.

Via

Access Paper or Ask Questions

Probing Pretrained Models of Source Code

Feb 16, 2022

Sergey Troshin, Nadezhda Chirkova

Figure 1 for Probing Pretrained Models of Source Code

Figure 2 for Probing Pretrained Models of Source Code

Figure 3 for Probing Pretrained Models of Source Code

Figure 4 for Probing Pretrained Models of Source Code

Abstract:Deep learning models are widely used for solving challenging code processing tasks, such as code generation or code summarization. Traditionally, a specific model architecture was carefully built to solve a particular code processing task. However, recently general pretrained models such as CodeBERT or CodeT5 have been shown to outperform task-specific models in many applications. While pretrained models are known to learn complex patterns from data, they may fail to understand some properties of source code. To test diverse aspects of code understanding, we introduce a set of diagnosting probing tasks. We show that pretrained models of code indeed contain information about code syntactic structure and correctness, the notions of identifiers, data flow and namespaces, and natural language naming. We also investigate how probing results are affected by using code-specific pretraining objectives, varying the model size, or finetuning.

Via

Access Paper or Ask Questions

Machine Learning Methods for Spectral Efficiency Prediction in Massive MIMO Systems

Dec 29, 2021

Evgeny Bobrov, Sergey Troshin, Nadezhda Chirkova, Ekaterina Lobacheva, Sviatoslav Panchenko, Dmitry Vetrov, Dmitry Kropotov

Figure 1 for Machine Learning Methods for Spectral Efficiency Prediction in Massive MIMO Systems

Figure 2 for Machine Learning Methods for Spectral Efficiency Prediction in Massive MIMO Systems

Figure 3 for Machine Learning Methods for Spectral Efficiency Prediction in Massive MIMO Systems

Figure 4 for Machine Learning Methods for Spectral Efficiency Prediction in Massive MIMO Systems

Abstract:Channel decoding, channel detection, channel assessment, and resource management for wireless multiple-input multiple-output (MIMO) systems are all examples of problems where machine learning (ML) can be successfully applied. In this paper, we study several ML approaches to solve the problem of estimating the spectral efficiency (SE) value for a certain precoding scheme, preferably in the shortest possible time. The best results in terms of mean average percentage error (MAPE) are obtained with gradient boosting over sorted features, while linear models demonstrate worse prediction quality. Neural networks perform similarly to gradient boosting, but they are more resource- and time-consuming because of hyperparameter tuning and frequent retraining. We investigate the practical applicability of the proposed algorithms in a wide range of scenarios generated by the Quadriga simulator. In almost all scenarios, the MAPE achieved using gradient boosting and neural networks is less than 10\%.

* To appear in Optimization Methods & Software, 22 pages, 10 figures, 2 tables

Via

Access Paper or Ask Questions

A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code

Oct 23, 2020

Nadezhda Chirkova, Sergey Troshin

Figure 1 for A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code

Figure 2 for A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code

Abstract:There is an emerging interest in the application of deep learning models to source code processing tasks. One of the major problems in applying deep learning to software engineering is that source code often contains a lot of rare identifiers resulting in huge vocabularies. We propose a simple yet effective method based on identifier anonymization to handle out-of-vocabulary (OOV) identifiers. Our method can be treated as a preprocessing step and therefore allows an easy implementation. We show that the proposed OOV anonymization method significantly improves the performance of the Transformer in two code processing tasks: code completion and bug fixing.

Via

Access Paper or Ask Questions

Empirical Study of Transformers for Source Code

Oct 15, 2020

Nadezhda Chirkova, Sergey Troshin

Figure 1 for Empirical Study of Transformers for Source Code

Figure 2 for Empirical Study of Transformers for Source Code

Figure 3 for Empirical Study of Transformers for Source Code

Figure 4 for Empirical Study of Transformers for Source Code

Abstract:Initially developed for natural language processing (NLP), Transformers are now widely used for source code processing, due to the format similarity between source code and text. In contrast to natural language, source code is strictly structured, i. e. follows the syntax of the programming language. Several recent works develop Transformer modifications for capturing syntactic information in source code. The drawback of these works is that they do not compare to each other and all consider different tasks. In this work, we conduct a thorough empirical study of the capabilities of Transformers to utilize syntactic information in different tasks. We consider three tasks (code completion, function naming and bug fixing) and re-implement different syntax-capturing modifications in a unified framework. We show that Transformers are able to make meaningful predictions based purely on syntactic information and underline the best practices of taking the syntactic information into account for improving the performance of the model.

Via

Access Paper or Ask Questions