TabularFM: An Open Framework For Tabular Foundational Models

¹Vietnam National University, Ho Chi Minh City, Vietnam
²IBM Research, Yorktown US
³IBM Research, Dublin Ireland

^*Equal Contribution

Abstract

Foundational models (FMs), pretrained on extensive datasets using self-supervised techniques, are capable of learning generalized patterns from large amounts of data. This reduces the need for extensive labeled datasets for each new task, saving both time and resources by leveraging the broad knowledge base established during pretraining. Most research on FMs has primarily focused on unstructured data, such as text and images, or semi-structured data, like time-series. However, there has been limited attention to structured data, such as tabular data, which, despite its prevalence, remains under-studied due to a lack of clean datasets and insufficient research on the transferability of FMs for various tabular data tasks. In response to this gap, we introduce a framework called TabularFM, which incorporates state-of-the-art methods for developing FMs specifically for tabular data. This includes variations of neural architectures such as GANs, VAEs, and Transformers. We have curated millions of tabular datasets and released cleaned versions to facilitate the development of tabular FMs. We pretrained FMs on this curated data, benchmarked various learning methods on these datasets, and released the pretrained models along with leaderboards for future comparative studies. Our fully open-sourced system provides a comprehensive analysis of the transferability of tabular FMs. By releasing these datasets, pretrained models, and leaderboards, we aim to enhance the validity and usability of tabular FMs in the near future.

Leaderboards

Rank	Model	Contact	#Params	Paper	Code	Shape score	Trend score	Overall score
1	GReaT	Vadim Borisov	81,912,576	paper	github	0.73 (0.16)	0.5 (0.25)	0.61 (0.18)
2	CTGAN	Lei Xu	97,243,685	paper	github	0.70 (0.16)	0.49 (0.26)	0.59 (0.19)
3	STVAE	Quan Tran	9,315,214	N/A	github	0.54 (0.16)	0.45 (0.28)	0.50 (0.19)
4	TVAE	Lei Xu	N/A	paper	github	0.47 (0.13)	0.49 (0.27)	0.48 (0.17)
5	STVAEM	Quan Tran	N/A	N/A	github	0.43 (0.17)	0.40 (0.24)	0.42 (0.16)

Rank	Model	Contact	#Params	Paper	Code	Shape score	Trend score	Overall score
1	GReaT	Vadim Borisov	81,912,576	paper	github	0.72 (0.14)	0.56 (0.23)	0.64 (0.16)
2	CTGAN	Lei Xu	97,243,685	paper	github	0.69 (0.12)	0.53 (0.24)	0.62 (0.15)
3	STVAE	Quan Tran	9,315,214	N/A	github	0.48 (0.13)	0.43 (0.25)	0.46 (0.13)
4	TVAE	Lei Xu	N/A	paper	github	0.39 (0.13)	0.45 (0.27)	0.42 (0.17)
5	STVAEM	Quan Tran	N/A	N/A	github	0.43 (0.11)	0.39 (0.22)	0.41 (0.13)

Rank	Model	Contact	#Params	Paper	Code	Shape score	Trend score	Overall score
1	CTGAN	Lei Xu	97,243,685	paper	github	0.67 (0.12)	0.55 (0.24)	0.61 (0.14)
2	GReaT	Vadim Borisov	81,912,576	paper	github	0.68 (0.22)	0.56 (0.27)	0.61 (0.19)
3	STVAE	Quan Tran	9,315,214	N/A	github	0.55 (0.12)	0.59 (0.27)	0.57 (0.13)
4	TVAE	Lei Xu	N/A	paper	github	0.45 (0.13)	0.49 (0.27)	0.47 (0.15)

Transferability Analysis

TabularFM: An Open Framework For Tabular Foundational Models

Abstract

Leaderboards

Transferability performance of pretrained vs. trained from scratch models

Transferability Analysis

Wordclouds showing best and worse transferability columns of Kaggle validation datasets in STVAE experiment.

Top 30 column pairs with best and worst transferability (left and right) in Kaggle test datasets from STVAE experiments.

Training between pretrained vs. trained from scratch models.

Distribution of data from numerical columns between pretrained vs. trained from scratch models. Data is retrieved from Kaggle datasets, STVAE.

Distribution of data from categorical columns between pretrained vs. trained from scratch models. Data is retrieved from Kaggle datasets, STVAE.

BibTeX