sdv-dev / sdv Goto Github PK

Synthetic data generation for tabular data

License: Other

Makefile 0.39% Python 99.61%

synthetic-data machine-learning relational-datasets multi-table time-series synthetic-data-generation sdv data-generation generative-adversarial-network gan

sdv's Introduction

This repository is part of The Synthetic Data Vault Project, a project from DataCebo.

Overview

The Synthetic Data Vault (SDV) is a Python library designed to be your one-stop shop for creating tabular synthetic data. The SDV uses a variety of machine learning algorithms to learn patterns from your real data and emulate them in synthetic data.

Features

🧠 Create synthetic data using machine learning. The SDV offers multiple models, ranging from classical statistical methods (GaussianCopula) to deep learning methods (CTGAN). Generate data for single tables, multiple connected tables or sequential tables.

📊 Evaluate and visualize data. Compare the synthetic data to the real data against a variety of measures. Diagnose problems and generate a quality report to get more insights.

🔄 Preprocess, anonymize and define constraints. Control data processing to improve the quality of synthetic data, choose from different types of anonymization and define business rules in the form of logical constraints.

Important Links
Tutorials	Get some hands-on experience with the SDV. Launch the tutorial notebooks and run the code yourself.
📖 Docs	Learn how to use the SDV library with user guides and API references.
📙 Blog	Get more insights about using the SDV, deploying models and our synthetic data community.
Community	Join our Slack workspace for announcements and discussions.
💻 Website	Check out the SDV website for more information about the project.

Install

The SDV is publicly available under the Business Source License. Install SDV using pip or conda. We recommend using a virtual environment to avoid conflicts with other software on your device.

pip install sdv

conda install -c pytorch -c conda-forge sdv

Getting Started

Load a demo dataset to get started. This dataset is a single table describing guests staying at a fictional hotel.

from sdv.datasets.demo import download_demo

real_data, metadata = download_demo(
    modality='single_table',
    dataset_name='fake_hotel_guests')

The demo also includes metadata, a description of the dataset, including the data types in each column and the primary key (guest_email).

Synthesizing Data

Next, we can create an SDV synthesizer, an object that you can use to create synthetic data. It learns patterns from the real data and replicates them to generate synthetic data. Let's use the GaussianCopulaSynthesizer.

from sdv.single_table import GaussianCopulaSynthesizer

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(data=real_data)

And now the synthesizer is ready to create synthetic data!

synthetic_data = synthesizer.sample(num_rows=500)

The synthetic data will have the following properties:

Sensitive columns are fully anonymized. The email, billing address and credit card number columns contain new data so you don't expose the real values.
Other columns follow statistical patterns. For example, the proportion of room types, the distribution of check in dates and the correlations between room rate and room type are preserved.
Keys and other relationships are intact. The primary key (guest email) is unique for each row. If you have multiple tables, the connection between a primary and foreign keys makes sense.

Evaluating Synthetic Data

The SDV library allows you to evaluate the synthetic data by comparing it to the real data. Get started by generating a quality report.

from sdv.evaluation.single_table import evaluate_quality

quality_report = evaluate_quality(
    real_data,
    synthetic_data,
    metadata)

Generating report ...

(1/2) Evaluating Column Shapes: |████████████████| 9/9 [00:00<00:00, 1133.09it/s]|
Column Shapes Score: 89.11%

(2/2) Evaluating Column Pair Trends: |██████████████████████████████████████████| 36/36 [00:00<00:00, 502.88it/s]|
Column Pair Trends Score: 88.3%

Overall Score (Average): 88.7%

This object computes an overall quality score on a scale of 0 to 100% (100 being the best) as well as detailed breakdowns. For more insights, you can also visualize the synthetic vs. real data.

from sdv.evaluation.single_table import get_column_plot

fig = get_column_plot(
    real_data=real_data,
    synthetic_data=synthetic_data,
    column_name='amenities_fee',
    metadata=metadata
)
    
fig.show()

What's Next?

Using the SDV library, you can synthesize single table, multi table and sequential data. You can also customize the full synthetic data workflow, including preprocessing, anonymization and adding constraints.

To learn more, visit the SDV Demo page.

Credits

Thank you to our team of contributors who have built and maintained the SDV ecosystem over the years!

View Contributors

Citation

If you use SDV for your research, please cite the following paper:

Neha Patki, Roy Wedge, Kalyan Veeramachaneni. The Synthetic Data Vault. IEEE DSAA 2016.

@inproceedings{
    SDV,
    title={The Synthetic data vault},
    author={Patki, Neha and Wedge, Roy and Veeramachaneni, Kalyan},
    booktitle={IEEE International Conference on Data Science and Advanced Analytics (DSAA)},
    year={2016},
    pages={399-410},
    doi={10.1109/DSAA.2016.49},
    month={Oct}
}

The Synthetic Data Vault Project was first created at MIT's Data to AI Lab in 2016. After 4 years of research and traction with enterprise, we created DataCebo in 2020 with the goal of growing the project. Today, DataCebo is the proud developer of SDV, the largest ecosystem for synthetic data generation & evaluation. It is home to multiple libraries that support synthetic data, including:

🔄 Data discovery & transformation. Reverse the transforms to reproduce realistic data.
🧠 Multiple machine learning models -- ranging from Copulas to Deep Learning -- to create tabular, multi table and time series data.
📊 Measuring quality and privacy of synthetic data, and comparing different synthetic data generation models.

Get started using the SDV package -- a fully integrated solution and your one-stop shop for synthetic data. Or, use the standalone libraries for specific needs.

sdv's People

Contributors

Stargazers

Watchers

Forkers

robertsievert pvk-developer manuelalvarezc pajachiet aylr yakovkeselman kjella adjames1 acere csala stonedude ush19 kennethban zuberek kventures rquintin lilyx2021 mxsobrinho sanket-kamthe hyperdimensions pythiac modelriskanalytics jacobmunson sudhakarkr brokad jagdishkolhe cristiancuadrado joanvaquer amarjitghuman juiko zyteka juliengervai trendingtechnology lihuaxiong2020 goldstar111 daanknoors rohitpandey13 atwine adbmd ramyh canslove nasonz exp-time-series-tools jbjuin boonyew rch danielschulz buabaj skyeeiskowitz dan-hanley-privitar abhishekvermasg regi-na donaldrivard domidzitac alexalexeyuk sandy4321 leferrad ppeddada97 argo12 jgeofil sammykol83 surajitdb joydeep75 akm5160 littlefish12 mouad001 waihoh fitree skumarlabs andydigitalops maulberto3 na399 causalisai rguikers sarahmish hankers tim5go wfatherley scli-csrg shlomihod cristianolnpinto vivanvatsa pragyanaischool fealho incompleteml compassred cheyannesim kvrameshreddy joshwahistaken zeeroocooll prashantcraju poornima-sivanand barbachanmariel bjg-clapp sterlingjosh eperrier hitum-dev techthiyanes saraswathykrk cesarriat

sdv's Issues

Investigate GLFS usage for demo data

Investigate whether the Git Large File Storage infrastructure is suitable to store the demo data.

Separate Data Loading into another class

Description

Create another class responsible for loading data and returning a DataNavigator object.
Remove all loading logic from DataNavigator and move to this new class

Rename Variables

SDV version:
Python version:
Operating System:

Description

In DataNavigator:
transformed -> transformed_data
_parse_data -> _parse_meta_data

In Modeler:
model_type -> tuple of overall model type name, and list of parameters ie. ('GaussianCopula', ['GaussianUnivariate']}
sets -> conditional_data

SDV Modeler Index issue

Lines 54, 63 an 64 of Modeler should be changed to use iloc, because sometimes the index of the dataframe doesn't start at 0 and increase normally.

Use pypi versions of Copulas and RDT

This week two of the dependencies of the project have released new versions. We should check that everything works fine with the new versions and change the project dependencies to the newer versions.

Add support for modelling multi-parent tables

Currently SDV is not able to model nor sample a table with multiple foreign_keys, whether or not they are from different tables or the same one repeated.

We should find a way to model and sample such tables.

README fixes

After running the README step by step I found some issues that need to be fixed:

Formatting python snippets as such instead of bash
On install instructions, change conda instructions for vainilla venv, if it's really needed ( we can just put the normal install from sources instructions)
On the code examples, change import * with the concrete modules to import.
When showing the values of a dataframe:
· Avoid using print as is redundant.
· Don't print the whole dataframe, doing a transposed head (df.head(3).T), will be more readable.
When users_meta ( a nested dict) is obtained, is displayed using print, that flattens it, making it harder to understand it's structure. will be better to call it without print or use pprint instead.
On save_model, create folder models if it doesn't exist.

Even if it doesn't crash, at some points in the execution, warnings arise, solving the, will be a plus:

>>> modeler.model_database()
/home/xino/.virtualenvs/sdv_mit/lib/python3.6/site-packages/numpy/lib/function_base.py:3082: RuntimeWarning: invalid value encountered in subtract
  X -= avg[:, None]
/home/xino/.virtualenvs/sdv_mit/lib/python3.6/site-packages/pandas/core/frame.py:5550: RuntimeWarning: Degrees of freedom <= 0 for slice
  baseCov = np.cov(mat.T)
/home/xino/.virtualenvs/sdv_mit/lib/python3.6/site-packages/numpy/lib/function_base.py:3088: RuntimeWarning: divide by zero encountered in double_scalars
  c *= 1. / np.float64(fact)
/home/xino/.virtualenvs/sdv_mit/lib/python3.6/site-packages/numpy/lib/function_base.py:3088: RuntimeWarning: invalid value encountered in multiply
  c *= 1. / np.float64(fact)
/home/xino/Pythia/MIT/SDV/sdv/Modeler.py:83: RuntimeWarning: '>' not supported between instances of 'str' and 'int', sort order is undefined for incomparable objects
  extended_table = extended_table.append(row, ignore_index=True)

>>> sampler.sample_all()
/home/xino/.virtualenvs/sdv_mit/src/copulas/copulas/multivariate/GaussianCopula.py:88: RuntimeWarning: covariance is not positive-semidefinite.
  samples = np.random.multivariate_normal(clean_mean, clean_cov, size=s)
/home/xino/.virtualenvs/sdv_mit/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:1907: RuntimeWarning: invalid value encountered in add
  lower_bound = self.a * scale + loc

Enforce data constraints

After this issue is solved, we should be ready to enforce data constraints on sampled data.

In order to implement them, they should be checked after data is sampled and reverse_transformed but before is returned. It should be on sdv.Sampler.sample_rows as it's the common access to the process of sampling for the three public methods. The roadmap should be as follows:

1-. Create a method sdv.Sampler.check_constraints that gets a dataframe sampled and reverse transformed and return an array of indices corresponding to rows that fulfill constraints.

2-. Modify the method sdv.Sampler.sample_rows that handles the process of sampling, but before returning the result, checks the data fullfill the constraints, discard the rows that fail and samples again until it gets to the desired number of rows.

Remove Primary Key Requirement

SDV version:
Python version:
Operating System:

Description

SDV currently requires that a primary key be defined for every table in the meta. This is not actually necessary and the dependency should be removed.

Optimize get_conditional_data loop and CPA loop

Description

There are two loops in CPA that might not be optimized. This should be further investigated by timing the loops and optimized if possible.

Extended parameters not being passed up

Description

If there is a table that has a parent and a child, it currently isn't passing its added parameters up to its parent during modeling.

Fix should be on line 147

Fix hyper_fit_transform call

RDT changed the method in hyper transformer from hyper_fit_transform to fit_transform. SDV still calls the old method on line 108 of DataNavigator.py. This should be changed to call the new method.

Remove hyper transformer dependency from Modeler

Description

In Modeler.py, hyper_transformer is used to clean tables before modeling (remove added Nans). This dependency should be removed, and the imputing should be done within Modeler.py itself.

Raise exception on fit for unsupported dataset structure.

As stated here, some dataset structures are not yet supported by SDV.

It would be useful if we raise an exception at fit explaining the reasons.

Improve copula parameters sampling

During the modeling of the database in sdv.Modeler, extensions are created for each row of the parent tables containing the parameters to model the children tables.

On sampling time, this extensions are sampled too and later the parameters extracted and used to create the models to sample the children rows.

When creating new models from the sampled parameters, sometimes the models are created with inconsistent values. So far the following have been found:

The sampled covariance matrix may not be positive-semidefinite, which is a requirement for copulas.multivaritate.GaussianMultivariate copula, which raises this warning:

sdv_mit/lib/python3.6/site-packages/copulas/multivariate/gaussian.py:199: RuntimeWarning: covariance is not positive-semidefinite.
   samples = np.random.multivariate_normal(means, clean_cov, size=size)

If by any chance the sampled value for the std of the copulas.univariate.GaussianUnivariate distribution is negative or zero the value of the generated sampled will be np.nan

Sample Parents Using Copulas

SDV version:
Python version:
Operating System:

Description

Using the models generated by the modeler, we want to sample rows for parents. Every time a new row is sampled, the primary key and row should be stored, so that children can generate models for the primary keys.

Add Evaluation Metrics

Find a way to evaluate the output of SDV.

Time
Accuracy for numeric columns
Accuracy for categorical columns

Change the modeler save/load API.

Right now, the functionality to save/load a model, is invoked like this:

modeler.save_model('demo_model')
modeler = sdv.utils.load_model('sdv/models/demo_model.pkl')

Here we are saving and loading the same model.
A few problems arise:

1-. The input value on both function should be the same, to avoid confussions.

2-. The saving is done building a path relative to the file modeler, while the loading is done using the path as it comes. This can cause unexpected behavior for the end-user. Could we take this value from a configuration file?

3-. It makes little sense to have a function that loads a class instance as a standalone function in a separate model, when it could be a classmethod on the Modeler class

Refactor repeated lines on sdv.sampler.sample_rows

On sdv.sampler.sample_rows:
https://github.com/HDI-Project/SDV/blob/687d30a090bd2424abf675e66349bb516e4d6a5b/sdv/sampler.py#L52-L71
and https://github.com/HDI-Project/SDV/blob/687d30a090bd2424abf675e66349bb516e4d6a5b/sdv/sampler.py#L101-L117

are basically identical. It could be a good idea to move that code to a separate method sdv.sampler.transform_sampled_rows to reduce duplicated code.

Ensure unicity on primary keys on different calls

Currently, primary keys are generated using exrex module and the regex from the meta.json file.The way it's implemented, if we sample a single time, we are guaranteed that the primary keys will be unique, however, if we sample more than once, it's possible to obtain again keys that have been returned in the previous call.

Should we ensure uniqueness in this scenario?
Note that if we do this, we will only be able to sample as many rows as different matches the regex allows, afterwards we'll need a way to reset the database before sampling anything else.

For example, if we had a dataset consisting of a single table, with a single column, which is the primary key with regex [1-5]{1}, then the following could happen:

>>>...
>>>first_samples = sampler.sample_all(num_rows=3)
>>>first_samples.T
   primary_key
0            1
1            2
2            3

# Then it's no guaranteed that if we sample a single row more, it's primary key will be neither 4 or 5
>>> second_sample = sampler.sample_all(num_rows=1)
>>> second_sample
   primary_key
0            3

Separate modeling from getting conditional data

Description

In the _get_conditional_data function there is logic to model the data. This should be separated into another function.

Add testing datasets with more complex relationships

The current dataset we are using for unittesting is quite simple, and I'm afraid some issues may arise when working with datasets with more complex relations.

My proposal is to add datasets with:

A single parent and multiple childs
A child with multiple parents
Multiple multi-level relations (a child whose parents are also parents of some of the parents of the child,...)

This would help us to catch edge cases we may have not considered.

Add copula models to modeler.py

SDV version:
Python version:
Operating System:

Description

The modeler should now store copula models as it runs RCPA. RCPA should now add the flattened models to tables.

Generate multiple rows

SDV version:
Python version:
Operating System:

Description

Generate more that one row at a time.

Add get_dataframe and get_metadata functions to DataNavigator

Currently, it is unclear for how a user can access the dataframes or meta-data for a specific table. Functions should be added to data_navigator to make this easier.

-def get_dataframe(table_name): returns the dataframe for the specified table
-def get_meta_data(table_name): returns the meta information for the specified table

Fix cookiecutter details

Update requirements.txt to point to specific commit for copulas and rdt

SDV needs to specify commits for copulas and rdt that work, until the tagged versions get released

Fix NaN problem that happens when modelling

SDV version:
Python version:
Operating System:

Description

Covariance matrices are being filled with NaNs. This is likely because column of foreign key is being modeled (all values are the same which causes that column to become all NaNs when creating a copula from it).

What I Did

Running RCPA causes many of the copula models to get covariance matrices filled with NaNs

Integrate Dataprep

SDV version:
Python version:
Operating System:

Add dataprep to SDV.

Add support for Vine Copulas as a modeler.

It could be useful to add support for different models. To archieve that we should:

Wait for this issue of Copulas is done and released.
Update our requirements to work with the latests version of copulas.

Add a new method sdv.Modeler.flatten_dict, that gets a nested dictionary and returns it flattened:

 >>> nested_dict
 {
     'one_attribute': 0
     'nested_attribute': {
         'foo': 'bar
     }
 }
 
 >>> sdv.Modeler.flatten_dict(nested_dict)

 {
    'one_attribute': 0
    'nested_attribute__foo': 'bar'
 }

Add a new method sdv.Sampler.unflatten_dict that does the exact opposite, that is:

 >>> assert nested_dict == sdv.Sampler.unflatten_dict(sdv.Modeler.flatten_dict(nested_dict))
 >>> assert flattened_dict = sdv.Modeler.flatten_dict(sdv.Sampler.unflatten_dict(flattened_dict))

Change the behavior of sdv.Modeler.flatten_model in order for it recieve a modelas input and return a pandas.Series with the flattened model dict.
Rename the distribution keyword on sdv.Modeler.__init__ to model_kwargs that defaults to None, but when its present its passed to model when instances are created.
Change the behavior of sdv.Sampler._make_model_from_params to make that after the parameters have been retrieved from the parent_row are transform into a dictionary, passed to sdv.Sampler.unflatten_dict and the result to model.from_dict

Update SDV to work with Copula updates

Copulas changed the name of the classes that get referenced by SDV
Currently get following error from running SDV:
ModuleNotFoundError: No module named 'copulas.multivariate.GaussianCopula'

Enforce coding standards.

Fix python standards violations in the project such as:

Invalid file names.
Docstrings improperly formatted.

Also:

Remove unused files.
Delete unused variables
Refactor repeated chunks of code
Make sure make test-all pass without issues

Sample Children Using Copulas

SDV version:
Python version:
Operating System:

Description

Users should be able to generate rows for child tables. These tables should have foreign keys that refer to primary keys actually generated by parents.

Create sampling logic w/ dummy values

SDV version:
Python version:
Operating System:

Description

add ability to sample table recursively, but using random values instead of having the model generate the values

Add TravisCI

Add TravisCI to run builds after each commit, merge and PR.

Update copulas requirement version to 0.2.1 after it's released

Once the latest version of copulas is out, we should update our requirements in order to proceed with #71

Set Copulas as dependency

SDV version:
Python version:
Operating System:

Description

Copulas library needs to be dependency. Should be able to use Copulas in SDV

What I Did

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.

Modeler parameter not being used (?)

SDV version: 0.1.0
Python version: 3.6
Operating System: Fedora release 28 (Twenty Eight)

Description

I was trying to use the univariate KDE. To do that, I tried to set the distribution parameter in sdv.Modeler constructor to sdv.univariate.KDEUnivariate. The fitted modeler still uses sdv.univariate.GaussianUnivariate.

What I Did

I ran the following code:

import pandas as pd
import numpy as np

from copulas.univariate import KDEUnivariate
from copulas.univariate import GaussianUnivariate
from copulas.multivariate import VineCopula
from copulas.multivariate import GaussianMultivariate
from copulas.multivariate.tree import TreeTypes

from sdv import Sampler
from sdv import Modeler
from sdv import CSVDataLoader
from functools import partial


data_loader = CSVDataLoader('boston.json')
dn = data_loader.load_data()
dn.transform_data()
modeler = Modeler(dn, distribution=KDEUnivariate)
modeler.model_database()
sampler = Sampler(dn, modeler)

I checked the distribution for TAX feature and it follows, in the synthetic data, a gaussian distribution, while in the original data it wasn't gaussian. To check that, I looked into both the modeler and the following KDE plots:

If you want to run the code, you can use the annexed CSV and JSON files.

boston-data.zip

NaNs in Covariance matrix for parent models

SDV version:
Python version:
Operating System:

Description

If you model the database, the parent models receive data with NaNs, and then end up with covariance matrices that have nans. This makes sampling impossible.

Two possible causes:

Some parent primary keys are never referenced, so the extension is null.
The covariance matrix for different conditional data may have different sizes. This is likely a bug in copulas where numpy.cov is taking the rows as variables instead of the columns.

Minor issues after code review

Repeated string values, should be defined as module-level constants. (ie, ‘GENERATED_PRIMARY_KEY’ on sdv.Modeler)
On sdv.DataNavigator.DataNavigator : delete getter methods, access the attributes directly.
Change dict lookups for dict.get calls where possible.
Delete sdv.DataNavigator.DataNavigator.__init__: It simply calls super, so it does nothing by itself, and the call to super is done by inheritance.
Delete repeated methods sdv.Modeler.get_model and sdv.Modeler._get_model
On if statements, change comparison against empty set to comparison against object itself, like: if self.attribute instead of if self.attribute == set().

Integrate with faker library

Integrate with faker library for certain data types

https://github.com/stympy/faker

Improve model flattening and generate model meta

Description

When sdv-dev/Copulas#33 is finished, the methods for for flattening a model and reconstructing a model from a flattened one should be used in SDV for modeling and sampling respectively.

Generate Primary and Foreign Keys separately

SDV version:
Python version:
Operating System:

Description

Primary and foreign keys should be generated using regex, not the copula model.

Prepare 0.1.0 release

This issue includes all the task that need to be done before the release of the 0.1.0 version:

Installation works on a clean environment using make dist and installing the resulting tarball.
Build passes with make test-all.
Documentation includes necessary steps for installation and usage, minimal api reference and contributing guide.
README examples works perfectly and reflexes the latest changes made to the project.

synthesize rows given some restrictions

SDV version: 0.1.0
Python version: 3.6
Operating System: Fedora release 28 (Twenty Eight)

Description

In the master thesis and old documentation, it is stated that users are able to sample from tables according to arbitrary conditions on certain features. In the current version, I can't find anything like this in the documentation.

What I Did

I looked into the documentation and the source code.

Notes

I might be wrong but, probably, SDV team is waiting for a PR on this: sdv-dev/Copulas#47

Change extension check

https://github.com/HDI-Project/SDV/blob/a0da522a9e597b00be0a3a948cb962820528a82b/sdv/modeler.py#L181.

This line contains a check to ensure an extension is not None. However this is will allow an empty extension to be processed, which in turn may cause problems later.

Add documentation/reference for meta.json

Currently there is no documentation about what a meta.json file should contain for a given dataset. A minimal documentation should contain:

Reference for datasets
Reference for fields
Examples for simple cases
Links to external sources (RDT and Dataset Manager)

Ignore foreign key when modelling

SDV version:
Python version:
Operating System:

Description]

When creating a copula model to get the conditional data, the column of the foreign key should be ignored. This is because all values will be the same and the copula model will be messed up by this

Create Flatten Model Function

SDV version:
Python version:
Operating System:

Description

Given a Copula model, there should be a function that converts its parameters into an array (flattens the model).

Add sklearn to requirements.txt

sklearn is not in the requirements.txt, but is a required dependency.

sdv-dev / sdv Goto Github PK

sdv's Introduction

Overview

Features

Install

Getting Started

Synthesizing Data

Evaluating Synthetic Data

What's Next?

Credits

Citation

sdv's People

Contributors

Stargazers

Watchers

Forkers

sdv's Issues

Description

Description

Description

Description

Description

Description

Description

Description

Description

Description

Description

What I Did

Description

Description

Description

What I Did

Description

What I Did

Description

Description

Description

Description

What I Did

Notes

Description]

Description

Recommend Projects

Recommend Topics

Recommend Org