Skip to content

Overview

YData Synthetic Logo

pypi Pythonversion downloads Build Status Code Coverage GitHub stars Discord

Overview

YData-Synthetic is an open-source package developed in 2020 with the primary goal of educating users about generative models for synthetic data generation. Designed as a collection of models, it was intended for exploratory studies and educational purposes. However, it was not optimized for the quality, performance, and scalability needs typically required by organizations.

We are now ydata-sdk!

Even though the journey was fun, and we have learned a lot from the community it is now time to upgrade ydata-synthetic.

Heading towards the future of synthetic data generation we recommend users to transition to ydata-sdk, which provides a superior experience with enhanced performance, precision, and ease of use, making it the preferred tool for synthetic data generation and a perfect introduction to Generative AI.

Supported Data Types

Tabular data does not have a temporal dependence, and can be structured and organized in a table-like format, where features are represented in columns, whereas observations correspond to the rows.

Additionally, tabular data usually comprises both numeric and categorical features. Numeric features are those that encode quantitative values, whereas categorical represent qualitative measurements. Categorical features can further divided in ordinal, binary or boolean, and nominal features.

Learn more about synthesizing tabular data in this article, or check the quickstart guide to get started with the synthesization of tabular datasets.

Time-series data exhibit a sequencial, temporal dependency between records, and may present a wide range of patterns and trends, including seasonality (patterns that repeat at calendar periods -- days, weeks, months -- such as holiday sales, for instance) or periodicity (patterns that repeat over time).

Read more about generating time-series data in this article and check this quickstart guide to get started with time-series data synthesization.

Multi-Table data or databases exhibit a referential behaviour between and database schema that is expected to be replicated and respected by the synthetic data generated. Read more about database synthetic data generation in this article and check this quickstart guide for Multi-Table synthetic data generation Time-series data exhibit a sequential, temporal dependency between records, and may present a wide range of patterns and trends, including seasonality (patterns that repeat at calendar periods -- days, weeks, months -- such as holiday sales, for instance) or periodicity (patterns that repeat over time).

Validate the quality of your synthetic data generated

Validating the quality of synthetic data is essential to ensure its usefulness and privacy. YData Fabric provides tools for comprehensive synthetic data evaluation through:

  1. Profile Comparison Visualization: Fabric delivers side-by-side visual comparisons of key data properties (e.g., distributions, correlations, and outliers) between synthetic and original datasets, allowing users to assess fidelity at a glance.

  2. PDF Report with Metrics: Fabric generates a PDF report that includes key metrics to evaluate:

  3. Fidelity: How closely synthetic data matches the original.

  4. Utility: How well it performs in real-world tasks.
  5. Privacy: Risk assessment of data leakage and re-identification.

These tools ensure a thorough validation of synthetic data quality, making it reliable for real-world use.

Supported Generative AI Models

With the upcoming update of ydata-syntheticto ydata-sdk, users will now have access to a single API that automatically selects and optimizes the best generative model for their data. This streamlined approach eliminates the need to choose between various models manually, as the API intelligently identifies the optimal model based on the specific dataset and use case.

Instead of having to manually select from models such as:

  • GAN
  • CGAN (Conditional GAN)
  • WGAN (Wasserstein GAN)
  • WGAN-GP (Wassertein GAN with Gradient Penalty)
  • DRAGAN (Deep Regret Analytic GAN)
  • Cramer GAN (Cramer Distance Solution to Biased Wasserstein Gradients)
  • CWGAN-GP (Conditional Wassertein GAN with Gradient Penalty)
  • CTGAN (Conditional Tabular GAN)
  • TimeGAN (specifically for time-series data)
  • DoppelGANger (specifically for time-series data)

The new API handles model selection automatically, optimizing for the best performance in fidelity, utility, and privacy. This significantly simplifies the synthetic data generation process, ensuring that users get the highest quality output without the need for manual intervention and tiring hyperparameter tuning.