Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Lars Hertel

From Features to Transformers: Redefining Ranking for Scalable Impact

Feb 05, 2025

Fedor Borisyuk, Lars Hertel, Ganesh Parameswaran, Gaurav Srivastava, Sudarshan Srinivasa Ramanujam, Borja Ocejo, Peng Du, Andrei Akterskii, Neil Daftary, Shao Tang(+6 more)

Abstract:We present LiGR, a large-scale ranking framework developed at LinkedIn that brings state-of-the-art transformer-based modeling architectures into production. We introduce a modified transformer architecture that incorporates learned normalization and simultaneous set-wise attention to user history and ranked items. This architecture enables several breakthrough achievements, including: (1) the deprecation of most manually designed feature engineering, outperforming the prior state-of-the-art system using only few features (compared to hundreds in the baseline), (2) validation of the scaling law for ranking systems, showing improved performance with larger models, more training data, and longer context sequences, and (3) simultaneous joint scoring of items in a set-wise manner, leading to automated improvements in diversity. To enable efficient serving of large ranking models, we describe techniques to scale inference effectively using single-pass processing of user history and set-wise attention. We also summarize key insights from various ablation studies and A/B tests, highlighting the most impactful technical approaches.

Via

Access Paper or Ask Questions

Efficient user history modeling with amortized inference for deep learning recommendation models

Dec 09, 2024

Lars Hertel, Neil Daftary, Fedor Borisyuk, Aman Gupta, Rahul Mazumder

Abstract:We study user history modeling via Transformer encoders in deep learning recommendation models (DLRM). Such architectures can significantly improve recommendation quality, but usually incur high latency cost necessitating infrastructure upgrades or very small Transformer models. An important part of user history modeling is early fusion of the candidate item and various methods have been studied. We revisit early fusion and compare concatenation of the candidate to each history item against appending it to the end of the list as a separate item. Using the latter method, allows us to reformulate the recently proposed amortized history inference algorithm M-FALCON \cite{zhai2024actions} for the case of DLRM models. We show via experimental results that appending with cross-attention performs on par with concatenation and that amortization significantly reduces inference costs. We conclude with results from deploying this model on the LinkedIn Feed and Ads surfaces, where amortization reduces latency by 30\% compared to non-amortized inference.

* 5 pages, 3 figures, WWW 2025

Via

Access Paper or Ask Questions

LiRank: Industrial Large Scale Ranking Models at LinkedIn

Feb 10, 2024

Fedor Borisyuk, Mingzhou Zhou, Qingquan Song, Siyu Zhu, Birjodh Tiwana, Ganesh Parameswaran, Siddharth Dangi, Lars Hertel, Qiang Xiao, Xiaochen Hou(+24 more)

Figure 1 for LiRank: Industrial Large Scale Ranking Models at LinkedIn

Figure 2 for LiRank: Industrial Large Scale Ranking Models at LinkedIn

Figure 3 for LiRank: Industrial Large Scale Ranking Models at LinkedIn

Figure 4 for LiRank: Industrial Large Scale Ranking Models at LinkedIn

Abstract:We present LiRank, a large-scale ranking framework at LinkedIn that brings to production state-of-the-art modeling architectures and optimization methods. We unveil several modeling improvements, including Residual DCN, which adds attention and residual connections to the famous DCNv2 architecture. We share insights into combining and tuning SOTA architectures to create a unified model, including Dense Gating, Transformers and Residual DCN. We also propose novel techniques for calibration and describe how we productionalized deep learning based explore/exploit methods. To enable effective, production-grade serving of large ranking models, we detail how to train and compress models using quantization and vocabulary compression. We provide details about the deployment setup for large-scale use cases of Feed ranking, Jobs Recommendations, and Ads click-through rate (CTR) prediction. We summarize our learnings from various A/B tests by elucidating the most effective technical approaches. These ideas have contributed to relative metrics improvements across the board at LinkedIn: +0.5% member sessions in the Feed, +1.76% qualified job applications for Jobs search and recommendations, and +4.3% for Ads CTR. We hope this work can provide practical insights and solutions for practitioners interested in leveraging large-scale deep ranking systems.

Via

Access Paper or Ask Questions

Quantity vs. Quality: On Hyperparameter Optimization for Deep Reinforcement Learning

Jul 30, 2020

Lars Hertel, Pierre Baldi, Daniel L. Gillen

Figure 1 for Quantity vs. Quality: On Hyperparameter Optimization for Deep Reinforcement Learning

Figure 2 for Quantity vs. Quality: On Hyperparameter Optimization for Deep Reinforcement Learning

Figure 3 for Quantity vs. Quality: On Hyperparameter Optimization for Deep Reinforcement Learning

Figure 4 for Quantity vs. Quality: On Hyperparameter Optimization for Deep Reinforcement Learning

Abstract:Reinforcement learning algorithms can show strong variation in performance between training runs with different random seeds. In this paper we explore how this affects hyperparameter optimization when the goal is to find hyperparameter settings that perform well across random seeds. In particular, we benchmark whether it is better to explore a large quantity of hyperparameter settings via pruning of bad performers, or if it is better to aim for quality of collected results by using repetitions. For this we consider the Successive Halving, Random Search, and Bayesian Optimization algorithms, the latter two with and without repetitions. We apply these to tuning the PPO2 algorithm on the Cartpole balancing task and the Inverted Pendulum Swing-up task. We demonstrate that pruning may negatively affect the optimization and that repeated sampling does not help in finding hyperparameter settings that perform better across random seeds. From our experiments we conclude that Bayesian optimization with a noise robust acquisition function is the best choice for hyperparameter optimization in reinforcement learning tasks.

Via

Access Paper or Ask Questions

Sherpa: Robust Hyperparameter Optimization for Machine Learning

May 08, 2020

Lars Hertel, Julian Collado, Peter Sadowski, Jordan Ott, Pierre Baldi

Figure 1 for Sherpa: Robust Hyperparameter Optimization for Machine Learning

Figure 2 for Sherpa: Robust Hyperparameter Optimization for Machine Learning

Figure 3 for Sherpa: Robust Hyperparameter Optimization for Machine Learning

Figure 4 for Sherpa: Robust Hyperparameter Optimization for Machine Learning

Abstract:Sherpa is a hyperparameter optimization library for machine learning models. It is specifically designed for problems with computationally expensive, iterative function evaluations, such as the hyperparameter tuning of deep neural networks. With Sherpa, scientists can quickly optimize hyperparameters using a variety of powerful and interchangeable algorithms. Sherpa can be run on either a single machine or in parallel on a cluster. Finally, an interactive dashboard enables users to view the progress of models as they are trained, cancel trials, and explore which hyperparameter combinations are working best. Sherpa empowers machine learning practitioners by automating the more tedious aspects of model tuning. Its source code and documentation are available at https://github.com/sherpa-ai/sherpa.

Via

Access Paper or Ask Questions

Deep Convolutional Neural Networks as Generic Feature Extractors

Oct 06, 2017

Lars Hertel, Erhardt Barth, Thomas Käster, Thomas Martinetz

Figure 1 for Deep Convolutional Neural Networks as Generic Feature Extractors

Figure 2 for Deep Convolutional Neural Networks as Generic Feature Extractors

Figure 3 for Deep Convolutional Neural Networks as Generic Feature Extractors

Figure 4 for Deep Convolutional Neural Networks as Generic Feature Extractors

Abstract:Recognizing objects in natural images is an intricate problem involving multiple conflicting objectives. Deep convolutional neural networks, trained on large datasets, achieve convincing results and are currently the state-of-the-art approach for this task. However, the long time needed to train such deep networks is a major drawback. We tackled this problem by reusing a previously trained network. For this purpose, we first trained a deep convolutional network on the ILSVRC2012 dataset. We then maintained the learned convolution kernels and only retrained the classification part on different datasets. Using this approach, we achieved an accuracy of 67.68 % on CIFAR-100, compared to the previous state-of-the-art result of 65.43 %. Furthermore, our findings indicate that convolutional networks are able to learn generic feature extractors that can be used for different tasks.

* 4 pages, accepted version for publication in Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN), July 2015, Killarney, Ireland

Via

Access Paper or Ask Questions

CNN-LTE: a Class of 1-X Pooling Convolutional Neural Networks on Label Tree Embeddings for Audio Scene Recognition

Aug 15, 2016

Huy Phan, Lars Hertel, Marco Maass, Philipp Koch, Alfred Mertins

Figure 1 for CNN-LTE: a Class of 1-X Pooling Convolutional Neural Networks on Label Tree Embeddings for Audio Scene Recognition

Figure 2 for CNN-LTE: a Class of 1-X Pooling Convolutional Neural Networks on Label Tree Embeddings for Audio Scene Recognition

Figure 3 for CNN-LTE: a Class of 1-X Pooling Convolutional Neural Networks on Label Tree Embeddings for Audio Scene Recognition

Abstract:We describe in this report our audio scene recognition system submitted to the DCASE 2016 challenge. Firstly, given the label set of the scenes, a label tree is automatically constructed. This category taxonomy is then used in the feature extraction step in which an audio scene instance is represented by a label tree embedding image. Different convolutional neural networks, which are tailored for the task at hand, are finally learned on top of the image features for scene recognition. Our system reaches an overall recognition accuracy of 81.2% and 83.3% and outperforms the DCASE 2016 baseline with absolute improvements of 8.7% and 6.1% on the development and test data, respectively.

* Task1 technical report for the DCASE2016 challenge. arXiv admin note: text overlap with arXiv:1606.07908

Via

Access Paper or Ask Questions

CaR-FOREST: Joint Classification-Regression Decision Forests for Overlapping Audio Event Detection

Aug 15, 2016

Huy Phan, Lars Hertel, Marco Maass, Philipp Koch, Alfred Mertins

Figure 1 for CaR-FOREST: Joint Classification-Regression Decision Forests for Overlapping Audio Event Detection

Figure 2 for CaR-FOREST: Joint Classification-Regression Decision Forests for Overlapping Audio Event Detection

Figure 3 for CaR-FOREST: Joint Classification-Regression Decision Forests for Overlapping Audio Event Detection

Figure 4 for CaR-FOREST: Joint Classification-Regression Decision Forests for Overlapping Audio Event Detection

Abstract:This report describes our submissions to Task2 and Task3 of the DCASE 2016 challenge. The systems aim at dealing with the detection of overlapping audio events in continuous streams, where the detectors are based on random decision forests. The proposed forests are jointly trained for classification and regression simultaneously. Initially, the training is classification-oriented to encourage the trees to select discriminative features from overlapping mixtures to separate positive audio segments from the negative ones. The regression phase is then carried out to let the positive audio segments vote for the event onsets and offsets, and therefore model the temporal structure of audio events. One random decision forest is specifically trained for each event category of interest. Experimental results on the development data show that our systems significantly outperform the baseline on the Task2 evaluation while they are inferior to the baseline in the Task3 evaluation.

* Task2 and Task3 technical report for the DCASE2016 challenge

Via

Access Paper or Ask Questions

Label Tree Embeddings for Acoustic Scene Classification

Jul 26, 2016

Huy Phan, Lars Hertel, Marco Maass, Philipp Koch, Alfred Mertins

Figure 1 for Label Tree Embeddings for Acoustic Scene Classification

Figure 2 for Label Tree Embeddings for Acoustic Scene Classification

Figure 3 for Label Tree Embeddings for Acoustic Scene Classification

Figure 4 for Label Tree Embeddings for Acoustic Scene Classification

Abstract:We present in this paper an efficient approach for acoustic scene classification by exploring the structure of class labels. Given a set of class labels, a category taxonomy is automatically learned by collectively optimizing a clustering of the labels into multiple meta-classes in a tree structure. An acoustic scene instance is then embedded into a low-dimensional feature representation which consists of the likelihoods that it belongs to the meta-classes. We demonstrate state-of-the-art results on two different datasets for the acoustic scene classification task, including the DCASE 2013 and LITIS Rouen datasets.

* to appear in the Proceedings of ACM Multimedia 2016 (ACMMM 2016)

Via

Access Paper or Ask Questions

Classifying Variable-Length Audio Files with All-Convolutional Networks and Masked Global Pooling

Jul 11, 2016

Lars Hertel, Huy Phan, Alfred Mertins

Figure 1 for Classifying Variable-Length Audio Files with All-Convolutional Networks and Masked Global Pooling

Figure 2 for Classifying Variable-Length Audio Files with All-Convolutional Networks and Masked Global Pooling

Figure 3 for Classifying Variable-Length Audio Files with All-Convolutional Networks and Masked Global Pooling

Abstract:We trained a deep all-convolutional neural network with masked global pooling to perform single-label classification for acoustic scene classification and multi-label classification for domestic audio tagging in the DCASE-2016 contest. Our network achieved an average accuracy of 84.5% on the four-fold cross-validation for acoustic scene recognition, compared to the provided baseline of 72.5%, and an average equal error rate of 0.17 for domestic audio tagging, compared to the baseline of 0.21. The network therefore improves the baselines by a relative amount of 17% and 19%, respectively. The network only consists of convolutional layers to extract features from the short-time Fourier transform and one global pooling layer to combine those features. It particularly possesses neither fully-connected layers, besides the fully-connected output layer, nor dropout layers.

* Technical report for the DCASE-2016 challenge (task 1 and task 4)

Via

Access Paper or Ask Questions