Great Expectations is a Python-based open-source library for validating, documenting, and profiling your data. It helps you to maintain data quality and improve communication about data between teams. With Great Expectations, you can assert what you expect from the data you load and transform, and catch data issues quickly – Expectations are basically unit tests for your data.
About Great Expectations
Expectations are assertions about your data. In Great Expectations, those assertions are expressed in a declarative language in the form of simple, human-readable Python methods. For example, in order to assert that you want values in a column
passenger_count in your dataset to be integers between 1 and 6, you can say:
Great Expectations then uses this statement to validate whether the column
passenger_count in a given table is indeed between 1 and 6, and returns a success or failure result. The library currently provides several dozen highly expressive built-in Expectations, and allows you to write custom Expectations.
Great Expectations renders Expectations to clean, human-readable documentation called Data Docs. These HTML docs contain both your Expectation Suites as well as your data validation results each time validation is run – think of it as a continuously updated data quality report.
Validating your Synthetic Data with Great Expectations
1. Install the required libraries:
We recommend you create a virtual environment and install ydata-synthetic and great-expectations by running the following command on your terminal.
2. Generate your Synthetic Data:
In this example, we'll use CTGAN to synthesize samples from the Adult Census Income dataset:
from pmlb import fetch_data from ydata_synthetic.synthesizers.regular import RegularSynthesizer from ydata_synthetic.synthesizers import ModelParameters, TrainParameters # Load data and define the data processor parameters data = fetch_data('adult') num_cols = ['age', 'fnlwgt', 'capital-gain', 'capital-loss', 'hours-per-week'] cat_cols = ['workclass','education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country', 'target'] # Defining the training parameters batch_size = 500 epochs = 500+1 learning_rate = 2e-4 beta_1 = 0.5 beta_2 = 0.9 ctgan_args = ModelParameters(batch_size=batch_size, lr=learning_rate, betas=(beta_1, beta_2)) train_args = TrainParameters(epochs=epochs) synth = RegularSynthesizer(modelname='ctgan', model_parameters=ctgan_args) synth.fit(data=data, train_arguments=train_args, num_cols=num_cols, cat_cols=cat_cols) # Sample for the trained synthesizer and save the synthetic data synth_data = synth.sample(1000) synth_data.to_csv('data/adult_synthetic.csv', index=False)
3. Create a Data Context and Connect to Data:
great_expectations module, create a data context, and connect to your synthetic data:
4. Create Expectations:
5. Validate Data
To validate your data, define a checkpoint and examine the data to determine if it matches the defined Expectations:
And use the following code to view an HTML representation of the Validation results: