Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Josif Grabocka

Warmstarting for Scaling Language Models

Nov 11, 2024

Neeratyoy Mallik, Maciej Janowski, Johannes Hog, Herilalaina Rakotoarison, Aaron Klein, Josif Grabocka, Frank Hutter

Figure 1 for Warmstarting for Scaling Language Models

Figure 2 for Warmstarting for Scaling Language Models

Figure 3 for Warmstarting for Scaling Language Models

Figure 4 for Warmstarting for Scaling Language Models

Abstract:Scaling model sizes to scale performance has worked remarkably well for the current large language models paradigm. The research and empirical findings of various scaling studies led to novel scaling results and laws that guides subsequent research. High training costs for contemporary scales of data and models result in a lack of thorough understanding of how to tune and arrive at such training setups. One direction to ameliorate the cost of pretraining large models is to warmstart the large-scale training from smaller models that are cheaper to tune. In this work, we attempt to understand if the behavior of optimal hyperparameters can be retained under warmstarting for scaling. We explore simple operations that allow the application of theoretically motivated methods of zero-shot transfer of optimal hyperparameters using {\mu}Transfer. We investigate the aspects that contribute to the speedup in convergence and the preservation of stable training dynamics under warmstarting with {\mu}Transfer. We find that shrinking smaller model weights, zero-padding, and perturbing the resulting larger model with scaled initialization from {\mu}P enables effective warmstarting of $\mut{}$.

Via

Access Paper or Ask Questions

Ensembling Finetuned Language Models for Text Classification

Oct 25, 2024

Sebastian Pineda Arango, Maciej Janowski, Lennart Purucker, Arber Zela, Frank Hutter, Josif Grabocka

Figure 1 for Ensembling Finetuned Language Models for Text Classification

Figure 2 for Ensembling Finetuned Language Models for Text Classification

Figure 3 for Ensembling Finetuned Language Models for Text Classification

Figure 4 for Ensembling Finetuned Language Models for Text Classification

Abstract:Finetuning is a common practice widespread across different communities to adapt pretrained models to particular tasks. Text classification is one of these tasks for which many pretrained models are available. On the other hand, ensembles of neural networks are typically used to boost performance and provide reliable uncertainty estimates. However, ensembling pretrained models for text classification is not a well-studied avenue. In this paper, we present a metadataset with predictions from five large finetuned models on six datasets, and report results of different ensembling strategies from these predictions. Our results shed light on how ensembling can improve the performance of finetuned text classifiers and incentivize future adoption of ensembles in such tasks.

* Workshop on Fine-Tuning in Modern Machine Learning @ NeurIPS 2024. arXiv admin note: text overlap with arXiv:2410.04520

Via

Access Paper or Ask Questions

Lightweight Correlation-Aware Table Compression

Oct 24, 2024

Mihail Stoian, Alexander van Renen, Jan Kobiolka, Ping-Lin Kuo, Josif Grabocka, Andreas Kipf

Abstract:The growing adoption of data lakes for managing relational data necessitates efficient, open storage formats that provide high scan performance and competitive compression ratios. While existing formats achieve fast scans through lightweight encoding techniques, they have reached a plateau in terms of minimizing storage footprint. Recently, correlation-aware compression schemes have been shown to reduce file sizes further. Yet, current approaches either incur significant scan overheads or require manual specification of correlations, limiting their practicability. We present $\texttt{Virtual}$, a framework that integrates seamlessly with existing open formats to automatically leverage data correlations, achieving substantial compression gains while having minimal scan performance overhead. Experiments on data-gov datasets show that $\texttt{Virtual}$ reduces file sizes by up to 40% compared to Apache Parquet.

* Third Table Representation Learning Workshop (TRL @ NeurIPS 2024)

Via

Access Paper or Ask Questions

Dynamic Post-Hoc Neural Ensemblers

Oct 06, 2024

Sebastian Pineda Arango, Maciej Janowski, Lennart Purucker, Arber Zela, Frank Hutter, Josif Grabocka

Figure 1 for Dynamic Post-Hoc Neural Ensemblers

Figure 2 for Dynamic Post-Hoc Neural Ensemblers

Figure 3 for Dynamic Post-Hoc Neural Ensemblers

Figure 4 for Dynamic Post-Hoc Neural Ensemblers

Abstract:Ensemble methods are known for enhancing the accuracy and robustness of machine learning models by combining multiple base learners. However, standard approaches like greedy or random ensembles often fall short, as they assume a constant weight across samples for the ensemble members. This can limit expressiveness and hinder performance when aggregating the ensemble predictions. In this study, we explore employing neural networks as ensemble methods, emphasizing the significance of dynamic ensembling to leverage diverse model predictions adaptively. Motivated by the risk of learning low-diversity ensembles, we propose regularizing the model by randomly dropping base model predictions during the training. We demonstrate this approach lower bounds the diversity within the ensemble, reducing overfitting and improving generalization capabilities. Our experiments showcase that the dynamic neural ensemblers yield competitive results compared to strong baselines in computer vision, natural language processing, and tabular data.

* Preprint under review, 10 pages

Via

Access Paper or Ask Questions

Multi-objective Differentiable Neural Architecture Search

Feb 28, 2024

Rhea Sanjay Sukthanker, Arber Zela, Benedikt Staffler, Samuel Dooley, Josif Grabocka, Frank Hutter

Abstract:Pareto front profiling in multi-objective optimization (MOO), i.e. finding a diverse set of Pareto optimal solutions, is challenging, especially with expensive objectives like neural network training. Typically, in MOO neural architecture search (NAS), we aim to balance performance and hardware metrics across devices. Prior NAS approaches simplify this task by incorporating hardware constraints into the objective function, but profiling the Pareto front necessitates a search for each constraint. In this work, we propose a novel NAS algorithm that encodes user preferences for the trade-off between performance and hardware metrics, and yields representative and diverse architectures across multiple devices in just one search run. To this end, we parameterize the joint architectural distribution across devices and multiple objectives via a hypernetwork that can be conditioned on hardware features and preference vectors, enabling zero-shot transferability to new devices. Extensive experiments with up to 19 hardware devices and 3 objectives showcase the effectiveness and scalability of our method. Finally, we show that, without additional costs, our method outperforms existing MOO NAS methods across qualitatively different search spaces and datasets, including MobileNetV3 on ImageNet-1k and a Transformer space on machine translation.

* 31 pages, 22 figures

Via

Access Paper or Ask Questions

Hierarchical Transformers are Efficient Meta-Reinforcement Learners

Feb 09, 2024

Gresa Shala, André Biedenkapp, Josif Grabocka

Figure 1 for Hierarchical Transformers are Efficient Meta-Reinforcement Learners

Figure 2 for Hierarchical Transformers are Efficient Meta-Reinforcement Learners

Figure 3 for Hierarchical Transformers are Efficient Meta-Reinforcement Learners

Figure 4 for Hierarchical Transformers are Efficient Meta-Reinforcement Learners

Abstract:We introduce Hierarchical Transformers for Meta-Reinforcement Learning (HTrMRL), a powerful online meta-reinforcement learning approach. HTrMRL aims to address the challenge of enabling reinforcement learning agents to perform effectively in previously unseen tasks. We demonstrate how past episodes serve as a rich source of information, which our model effectively distills and applies to new contexts. Our learned algorithm is capable of outperforming the previous state-of-the-art and provides more efficient meta-training while significantly improving generalization capabilities. Experimental results, obtained across various simulated tasks of the Meta-World Benchmark, indicate a significant improvement in learning efficiency and adaptability compared to the state-of-the-art on a variety of tasks. Our approach not only enhances the agent's ability to generalize from limited data but also paves the way for more robust and versatile AI systems.

Via

Access Paper or Ask Questions

Tabular Data: Is Attention All You Need?

Feb 06, 2024

Guri Zabërgja, Arlind Kadra, Josif Grabocka

Figure 1 for Tabular Data: Is Attention All You Need?

Figure 2 for Tabular Data: Is Attention All You Need?

Figure 3 for Tabular Data: Is Attention All You Need?

Figure 4 for Tabular Data: Is Attention All You Need?

Abstract:Deep Learning has revolutionized the field of AI and led to remarkable achievements in applications involving image and text data. Unfortunately, there is inconclusive evidence on the merits of neural networks for structured tabular data. In this paper, we introduce a large-scale empirical study comparing neural networks against gradient-boosted decision trees on tabular data, but also transformer-based architectures against traditional multi-layer perceptrons (MLP) with residual connections. In contrast to prior work, our empirical findings indicate that neural networks are competitive against decision trees. Furthermore, we assess that transformer-based architectures do not outperform simpler variants of traditional MLP architectures on tabular datasets. As a result, this paper helps the research and practitioner communities make informed choices on deploying neural networks on future tabular data applications.

Via

Access Paper or Ask Questions

Quick-Tune: Quickly Learning Which Pretrained Model to Finetune and How

Jun 11, 2023

Sebastian Pineda Arango, Fabio Ferreira, Arlind Kadra, Frank Hutter, Josif Grabocka

Figure 1 for Quick-Tune: Quickly Learning Which Pretrained Model to Finetune and How

Figure 2 for Quick-Tune: Quickly Learning Which Pretrained Model to Finetune and How

Figure 3 for Quick-Tune: Quickly Learning Which Pretrained Model to Finetune and How

Figure 4 for Quick-Tune: Quickly Learning Which Pretrained Model to Finetune and How

Abstract:With the ever-increasing number of pretrained models, machine learning practitioners are continuously faced with which pretrained model to use, and how to finetune it for a new dataset. In this paper, we propose a methodology that jointly searches for the optimal pretrained model and the hyperparameters for finetuning it. Our method transfers knowledge about the performance of many pretrained models with multiple hyperparameter configurations on a series of datasets. To this aim, we evaluated over 20k hyperparameter configurations for finetuning 24 pretrained image classification models on 87 datasets to generate a large-scale meta-dataset. We meta-learn a multi-fidelity performance predictor on the learning curves of this meta-dataset and use it for fast hyperparameter optimization on new datasets. We empirically demonstrate that our resulting approach can quickly select an accurate pretrained model for a new dataset together with its optimal hyperparameters.

Via

Access Paper or Ask Questions

Deep Pipeline Embeddings for AutoML

May 24, 2023

Sebastian Pineda Arango, Josif Grabocka

Figure 1 for Deep Pipeline Embeddings for AutoML

Figure 2 for Deep Pipeline Embeddings for AutoML

Figure 3 for Deep Pipeline Embeddings for AutoML

Figure 4 for Deep Pipeline Embeddings for AutoML

Abstract:Automated Machine Learning (AutoML) is a promising direction for democratizing AI by automatically deploying Machine Learning systems with minimal human expertise. The core technical challenge behind AutoML is optimizing the pipelines of Machine Learning systems (e.g. the choice of preprocessing, augmentations, models, optimizers, etc.). Existing Pipeline Optimization techniques fail to explore deep interactions between pipeline stages/components. As a remedy, this paper proposes a novel neural architecture that captures the deep interaction between the components of a Machine Learning pipeline. We propose embedding pipelines into a latent representation through a novel per-component encoder mechanism. To search for optimal pipelines, such pipeline embeddings are used within deep-kernel Gaussian Process surrogates inside a Bayesian Optimization setup. Furthermore, we meta-learn the parameters of the pipeline embedding network using existing evaluations of pipelines on diverse collections of related datasets (a.k.a. meta-datasets). Through extensive experiments on three large-scale meta-datasets, we demonstrate that pipeline embeddings yield state-of-the-art results in Pipeline Optimization.

* 9 pages

Via

Access Paper or Ask Questions

Breaking the Paradox of Explainable Deep Learning

May 22, 2023

Arlind Kadra, Sebastian Pineda Arango, Josif Grabocka

Figure 1 for Breaking the Paradox of Explainable Deep Learning

Figure 2 for Breaking the Paradox of Explainable Deep Learning

Figure 3 for Breaking the Paradox of Explainable Deep Learning

Figure 4 for Breaking the Paradox of Explainable Deep Learning

Abstract:Deep Learning has achieved tremendous results by pushing the frontier of automation in diverse domains. Unfortunately, current neural network architectures are not explainable by design. In this paper, we propose a novel method that trains deep hypernetworks to generate explainable linear models. Our models retain the accuracy of black-box deep networks while offering free lunch explainability by design. Specifically, our explainable approach requires the same runtime and memory resources as black-box deep models, ensuring practical feasibility. Through extensive experiments, we demonstrate that our explainable deep networks are as accurate as state-of-the-art classifiers on tabular data. On the other hand, we showcase the interpretability of our method on a recent benchmark by empirically comparing prediction explainers. The experimental results reveal that our models are not only as accurate as their black-box deep-learning counterparts but also as interpretable as state-of-the-art explanation techniques.

Via

Access Paper or Ask Questions