Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Marek Polewczyk

PORTAL: Scalable Tabular Foundation Models via Content-Specific Tokenization

Oct 17, 2024

Marco Spinaci, Marek Polewczyk, Johannes Hoffart, Markus C. Kohler, Sam Thelin, Tassilo Klein

Figure 1 for PORTAL: Scalable Tabular Foundation Models via Content-Specific Tokenization

Figure 2 for PORTAL: Scalable Tabular Foundation Models via Content-Specific Tokenization

Figure 3 for PORTAL: Scalable Tabular Foundation Models via Content-Specific Tokenization

Figure 4 for PORTAL: Scalable Tabular Foundation Models via Content-Specific Tokenization

Abstract:Self-supervised learning on tabular data seeks to apply advances from natural language and image domains to the diverse domain of tables. However, current techniques often struggle with integrating multi-domain data and require data cleaning or specific structural requirements, limiting the scalability of pre-training datasets. We introduce PORTAL (Pretraining One-Row-at-a-Time for All tabLes), a framework that handles various data modalities without the need for cleaning or preprocessing. This simple yet powerful approach can be effectively pre-trained on online-collected datasets and fine-tuned to match state-of-the-art methods on complex classification and regression tasks. This work offers a practical advancement in self-supervised learning for large-scale tabular data.

* Accepted at Table Representation Learning Workshop at NeurIPS 2024

Via

Access Paper or Ask Questions

ClusterTabNet: Supervised clustering method for table detection and table structure recognition

Feb 12, 2024

Marek Polewczyk, Marco Spinaci

Abstract:We present a novel deep-learning-based method to cluster words in documents which we apply to detect and recognize tables given the OCR output. We interpret table structure bottom-up as a graph of relations between pairs of words (belonging to the same row, column, header, as well as to the same table) and use a transformer encoder model to predict its adjacency matrix. We demonstrate the performance of our method on the PubTables-1M dataset as well as PubTabNet and FinTabNet datasets. Compared to the current state-of-the-art detection methods such as DETR and Faster R-CNN, our method achieves similar or better accuracy, while requiring a significantly smaller model.

* 15 pages, 4 figures, submitted. The code will be released at https://github.com/SAP-samples

Via

Access Paper or Ask Questions