TabPFGen is a Python library for generating high-quality synthetic tabular data using energy-based modeling and stochastic gradient Langevin dynamics (SGLD). Built on the TabPFN transformer architecture, it supports both classification and regression tasks with comprehensive visualization tools.
While many tools exist for generating synthetic images or text, creating realistic tabular data that preserves the statistical properties of the original dataset has been challenging. TabPFGen addresses this need by leveraging the power of TabPFN and energy-based modeling.
TabPFGen is especially useful when:
pip install tabpfgen
from tabpfgen import TabPFGen
from tabpfgen.visuals import visualize_classification_results
from sklearn.datasets import load_breast_cancer
# Load data
X, y = load_breast_cancer(return_X_y=True)
# Initialize generator
generator = TabPFGen(n_sgld_steps=500)
# Generate synthetic data
X_synth, y_synth = generator.generate_classification(
X, y,
n_samples=100,
balance_classes=True
)
# Visualize results
visualize_classification_results(
X, y, X_synth, y_synth,
feature_names=load_breast_cancer().feature_names
)
from tabpfgen import TabPFGen
from tabpfgen.visuals import visualize_regression_results
from sklearn.datasets import load_diabetes
# Load regression dataset
X, y = load_diabetes(return_X_y=True)
# Initialize generator
generator = TabPFGen(n_sgld_steps=500)
# Generate synthetic regression data
X_synth, y_synth = generator.generate_regression(
X, y,
n_samples=100,
use_quantiles=True
)
# Visualize results
visualize_regression_results(
X, y, X_synth, y_synth,
feature_names=load_diabetes().feature_names
)
For detailed documentation and tutorials, visit our tutorial pages.
TabPFGen uses a two-step approach:
Feature Generation: Using SGLD (Stochastic Gradient Langevin Dynamics) guided by an energy function to create synthetic features that statistically resemble the original data.
Target Prediction: Leveraging TabPFN as an “expert consultant” to predict appropriate target values (classification labels or regression values) for the synthetic features.
The entire process is guided by an energy function that compares synthetic data points to the original data, ensuring statistical similarity.
TabPFGen includes comprehensive visualization tools to help verify the quality of your synthetic data:
This project is licensed under the MIT License - see the LICENSE file for details.