Skip to content

Synthesize tabular data

Outdated

Note that this example won't work with the latest version of ydata-synthetic.

Please check ydata-sdk to see how to generate synthetic data.

Using CTGAN to generate tabular synthetic data:

Real-world domains are often described by tabular data i.e., data that can be structured and organized in a table-like format, where features/variables are represented in columns, whereas observations correspond to the rows.

Additionally, real-world data usually comprises both numeric and categorical features. Numeric features are those that encode quantitative values, whereas categorical represent qualitative measurements.

CTGAN was specifically designed to deal with the challenges posed by tabular datasets, handling mixed (numeric and categorical) data:

Here’s an example of how to synthetize tabular data with CTGAN using the Adult Census Income dataset:

from pmlb import fetch_data

from ydata_synthetic.synthesizers.regular import RegularSynthesizer
from ydata_synthetic.synthesizers import ModelParameters, TrainParameters

# Load data and define the data processor parameters
data = fetch_data('adult')
num_cols = ['age', 'fnlwgt', 'capital-gain', 'capital-loss', 'hours-per-week']
cat_cols = ['workclass','education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex',
            'native-country', 'target']

# Defining the training parameters
batch_size = 500
epochs = 500+1
learning_rate = 2e-4
beta_1 = 0.5
beta_2 = 0.9

ctgan_args = ModelParameters(batch_size=batch_size,
                             lr=learning_rate,
                             betas=(beta_1, beta_2))

train_args = TrainParameters(epochs=epochs)
synth = RegularSynthesizer(modelname='ctgan', model_parameters=ctgan_args)
synth.fit(data=data, train_arguments=train_args, num_cols=num_cols, cat_cols=cat_cols)

synth.save('adult_ctgan_model.pkl')

#########################################################
#    Loading and sampling from a trained synthesizer    #
#########################################################
synth = RegularSynthesizer.load('adult_ctgan_model.pkl')
synth_data = synth.sample(1000)
print(synth_data)

Best practices & results optimization

Generate the best synthetic data quality

If you are having a hard time in ensuring that CTGAN returns the synthetic data quality that you need for your use-case give it a try to YData Fabric Synthetic Data. Fabric Synthetic Data generation is considered the best in terms of quality. Read more about it in this benchmark.

CTGAN, as any other Machine Learning model, requires optimization at the level of the data preparation as well as hyperparameter tuning. Here follows a list of best-practices and tips to improve your synthetic data quality:

  • Understand Your Data: Thoroughly understand the characteristics and distribution of your original dataset before using CTGAN. Identify important features, correlations, and patterns in the data. Leverage ydata-profiling feature to automate the process of understanding your data.

  • Data Preprocess: Clean and preprocess your data to handle missing values, outliers, and other anomalies before training CTGAN. Standardize or normalize numerical features to ensure consistent scales.

  • Feature Engineering: Create additional meaningful features that could improve the quality of the synthetic data.

  • Optimize Model Parameters: Experiment with CTGAN hyperparameters such as epochs, batch_size, and gen_dim to find the values that work best for your specific dataset. Fine-tune the learning rate for better convergence.

  • Conditional Generation: Leverage the conditional generation capabilities of CTGAN by specifying conditions for certain features if applicable. Adjust the conditioning mechanism to enhance the relevance of generated samples.

  • Handle Imbalanced Data: If your original dataset is imbalanced, ensure that CTGAN captures the distribution of minority classes effectively. Adjust sampling strategies if needed.

  • Use Larger Datasets: Train CTGAN on larger datasets when possible to capture a more comprehensive representation of the underlying data distribution.