Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Nick Erickson

TabRepo: A Large Scale Repository of Tabular Model Evaluations and its AutoML Applications

Nov 06, 2023

David Salinas, Nick Erickson

Abstract:We introduce TabRepo, a new dataset of tabular model evaluations and predictions. TabRepo contains the predictions and metrics of 1206 models evaluated on 200 regression and classification datasets. We illustrate the benefit of our datasets in multiple ways. First, we show that it allows to perform analysis such as comparing Hyperparameter Optimization against current AutoML systems while also considering ensembling at no cost by using precomputed model predictions. Second, we show that our dataset can be readily leveraged to perform transfer-learning. In particular, we show that applying standard transfer-learning techniques allows to outperform current state-of-the-art tabular systems in accuracy, runtime and latency.

Via

Access Paper or Ask Questions

AutoGluon-TimeSeries: AutoML for Probabilistic Time Series Forecasting

Aug 10, 2023

Oleksandr Shchur, Caner Turkmen, Nick Erickson, Huibin Shen, Alexander Shirkov, Tony Hu, Yuyang Wang

Abstract:We introduce AutoGluon-TimeSeries - an open-source AutoML library for probabilistic time series forecasting. Focused on ease of use and robustness, AutoGluon-TimeSeries enables users to generate accurate point and quantile forecasts with just 3 lines of Python code. Built on the design philosophy of AutoGluon, AutoGluon-TimeSeries leverages ensembles of diverse forecasting models to deliver high accuracy within a short training time. AutoGluon-TimeSeries combines both conventional statistical models, machine-learning based forecasting approaches, and ensembling techniques. In our evaluation on 29 benchmark datasets, AutoGluon-TimeSeries demonstrates strong empirical performance, outperforming a range of forecasting methods in terms of both point and quantile forecast accuracy, and often even improving upon the best-in-hindsight combination of prior methods.

* Published at AutoML Conference 2023

Via

Access Paper or Ask Questions

XTab: Cross-table Pretraining for Tabular Transformers

May 10, 2023

Bingzhao Zhu, Xingjian Shi, Nick Erickson, Mu Li, George Karypis, Mahsa Shoaran

Figure 1 for XTab: Cross-table Pretraining for Tabular Transformers

Figure 2 for XTab: Cross-table Pretraining for Tabular Transformers

Figure 3 for XTab: Cross-table Pretraining for Tabular Transformers

Figure 4 for XTab: Cross-table Pretraining for Tabular Transformers

Abstract:The success of self-supervised learning in computer vision and natural language processing has motivated pretraining methods on tabular data. However, most existing tabular self-supervised learning models fail to leverage information across multiple data tables and cannot generalize to new tables. In this work, we introduce XTab, a framework for cross-table pretraining of tabular transformers on datasets from various domains. We address the challenge of inconsistent column types and quantities among tables by utilizing independent featurizers and using federated learning to pretrain the shared component. Tested on 84 tabular prediction tasks from the OpenML-AutoML Benchmark (AMLB), we show that (1) XTab consistently boosts the generalizability, learning speed, and performance of multiple tabular transformers, (2) by pretraining FT-Transformer via XTab, we achieve superior performance than other state-of-the-art tabular deep learning models on various tasks such as regression, binary, and multiclass classification.

Via

Access Paper or Ask Questions

RLSbench: Domain Adaptation Under Relaxed Label Shift

Feb 06, 2023

Saurabh Garg, Nick Erickson, James Sharpnack, Alex Smola, Sivaraman Balakrishnan, Zachary C. Lipton

Abstract:Despite the emergence of principled methods for domain adaptation under label shift, the sensitivity of these methods for minor shifts in the class conditional distributions remains precariously under explored. Meanwhile, popular deep domain adaptation heuristics tend to falter when faced with shifts in label proportions. While several papers attempt to adapt these heuristics to accommodate shifts in label proportions, inconsistencies in evaluation criteria, datasets, and baselines, make it hard to assess the state of the art. In this paper, we introduce RLSbench, a large-scale relaxed label shift benchmark, consisting of >500 distribution shift pairs that draw on 14 datasets across vision, tabular, and language modalities and compose them with varying label proportions. First, we evaluate 13 popular domain adaptation methods, demonstrating more widespread failures under label proportion shifts than were previously known. Next, we develop an effective two-step meta-algorithm that is compatible with most deep domain adaptation heuristics: (i) pseudo-balance the data at each epoch; and (ii) adjust the final classifier with (an estimate of) target label distribution. The meta-algorithm improves existing domain adaptation heuristics often by 2--10\% accuracy points under extreme label proportion shifts and has little (i.e., <0.5\%) effect when label proportions do not shift. We hope that these findings and the availability of RLSbench will encourage researchers to rigorously evaluate proposed methods in relaxed label shift settings. Code is publicly available at https://github.com/acmi-lab/RLSbench.

Via

Access Paper or Ask Questions

Benchmarking Multimodal AutoML for Tabular Data with Text Fields

Nov 04, 2021

Xingjian Shi, Jonas Mueller, Nick Erickson, Mu Li, Alexander J. Smola

Figure 1 for Benchmarking Multimodal AutoML for Tabular Data with Text Fields

Figure 2 for Benchmarking Multimodal AutoML for Tabular Data with Text Fields

Figure 3 for Benchmarking Multimodal AutoML for Tabular Data with Text Fields

Figure 4 for Benchmarking Multimodal AutoML for Tabular Data with Text Fields

Abstract:We consider the use of automated supervised learning systems for data tables that not only contain numeric/categorical columns, but one or more text fields as well. Here we assemble 18 multimodal data tables that each contain some text fields and stem from a real business application. Our publicly-available benchmark enables researchers to comprehensively evaluate their own methods for supervised learning with numeric, categorical, and text features. To ensure that any single modeling strategy which performs well over all 18 datasets will serve as a practical foundation for multimodal text/tabular AutoML, the diverse datasets in our benchmark vary greatly in: sample size, problem types (a mix of classification and regression tasks), number of features (with the number of text columns ranging from 1 to 28 between datasets), as well as how the predictive signal is decomposed between text vs. numeric/categorical features (and predictive interactions thereof). Over this benchmark, we evaluate various straightforward pipelines to model such data, including standard two-stage approaches where NLP is used to featurize the text such that AutoML for tabular data can then be applied. Compared with human data science teams, the fully automated methodology that performed best on our benchmark (stack ensembling a multimodal Transformer with various tree models) also manages to rank 1st place when fit to the raw text/tabular data in two MachineHack prediction competitions and 2nd place (out of 2380 teams) in Kaggle's Mercari Price Suggestion Challenge.

* Proceedings of the Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks 2021

Via

Access Paper or Ask Questions

Fast, Accurate, and Simple Models for Tabular Data via Augmented Distillation

Jun 25, 2020

Rasool Fakoor, Jonas Mueller, Nick Erickson, Pratik Chaudhari, Alexander J. Smola

Figure 1 for Fast, Accurate, and Simple Models for Tabular Data via Augmented Distillation

Figure 2 for Fast, Accurate, and Simple Models for Tabular Data via Augmented Distillation

Figure 3 for Fast, Accurate, and Simple Models for Tabular Data via Augmented Distillation

Figure 4 for Fast, Accurate, and Simple Models for Tabular Data via Augmented Distillation

Abstract:Automated machine learning (AutoML) can produce complex model ensembles by stacking, bagging, and boosting many individual models like trees, deep networks, and nearest neighbor estimators. While highly accurate, the resulting predictors are large, slow, and opaque as compared to their constituents. To improve the deployment of AutoML on tabular data, we propose FAST-DAD to distill arbitrarily complex ensemble predictors into individual models like boosted trees, random forests, and deep networks. At the heart of our approach is a data augmentation strategy based on Gibbs sampling from a self-attention pseudolikelihood estimator. Across 30 datasets spanning regression and binary/multiclass classification tasks, FAST-DAD distillation produces significantly better individual models than one obtains through standard training on the original data. Our individual distilled models are over 10x faster and more accurate than ensemble predictors produced by AutoML tools like H2O/AutoSklearn.

Via

Access Paper or Ask Questions

AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data

Mar 13, 2020

Nick Erickson, Jonas Mueller, Alexander Shirkov, Hang Zhang, Pedro Larroy, Mu Li, Alexander Smola

Figure 1 for AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data

Figure 2 for AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data

Figure 3 for AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data

Figure 4 for AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data

Abstract:We introduce AutoGluon-Tabular, an open-source AutoML framework that requires only a single line of Python to train highly accurate machine learning models on an unprocessed tabular dataset such as a CSV file. Unlike existing AutoML frameworks that primarily focus on model/hyperparameter selection, AutoGluon-Tabular succeeds by ensembling multiple models and stacking them in multiple layers. Experiments reveal that our multi-layer combination of many models offers better use of allocated training time than seeking out the best. A second contribution is an extensive evaluation of public and commercial AutoML platforms including TPOT, H2O, AutoWEKA, auto-sklearn, AutoGluon, and Google AutoML Tables. Tests on a suite of 50 classification and regression tasks from Kaggle and the OpenML AutoML Benchmark reveal that AutoGluon is faster, more robust, and much more accurate. We find that AutoGluon often even outperforms the best-in-hindsight combination of all of its competitors. In two popular Kaggle competitions, AutoGluon beat 99% of the participating data scientists after merely 4h of training on the raw data.

Via

Access Paper or Ask Questions

Dex: Incremental Learning for Complex Environments in Deep Reinforcement Learning

Jun 19, 2017

Nick Erickson, Qi Zhao

Figure 1 for Dex: Incremental Learning for Complex Environments in Deep Reinforcement Learning

Figure 2 for Dex: Incremental Learning for Complex Environments in Deep Reinforcement Learning

Figure 3 for Dex: Incremental Learning for Complex Environments in Deep Reinforcement Learning

Figure 4 for Dex: Incremental Learning for Complex Environments in Deep Reinforcement Learning

Abstract:This paper introduces Dex, a reinforcement learning environment toolkit specialized for training and evaluation of continual learning methods as well as general reinforcement learning problems. We also present the novel continual learning method of incremental learning, where a challenging environment is solved using optimal weight initialization learned from first solving a similar easier environment. We show that incremental learning can produce vastly superior results than standard methods by providing a strong baseline method across ten Dex environments. We finally develop a saliency method for qualitative analysis of reinforcement learning, which shows the impact incremental learning has on network attention.

* NIPS 2017 submission, 10 pages, 26 figures

Via

Access Paper or Ask Questions