Git Product home page Git Product logo

sdv's Introduction


This repository is part of The Synthetic Data Vault Project, a project from DataCebo.

Dev Status PyPi Shield Unit Tests Integration Tests Coverage Status Downloads Colab Slack

Overview

The Synthetic Data Vault (SDV) is a Python library designed to be your one-stop shop for creating tabular synthetic data. The SDV uses a variety of machine learning algorithms to learn patterns from your real data and emulate them in synthetic data.

Features

🧠 Create synthetic data using machine learning. The SDV offers multiple models, ranging from classical statistical methods (GaussianCopula) to deep learning methods (CTGAN). Generate data for single tables, multiple connected tables or sequential tables.

πŸ“Š Evaluate and visualize data. Compare the synthetic data to the real data against a variety of measures. Diagnose problems and generate a quality report to get more insights.

πŸ”„ Preprocess, anonymize and define constraints. Control data processing to improve the quality of synthetic data, choose from different types of anonymization and define business rules in the form of logical constraints.

Important Links
Tutorials Get some hands-on experience with the SDV. Launch the tutorial notebooks and run the code yourself.
πŸ“– Docs Learn how to use the SDV library with user guides and API references.
πŸ“™ Blog Get more insights about using the SDV, deploying models and our synthetic data community.
Community Join our Slack workspace for announcements and discussions.
πŸ’» Website Check out the SDV website for more information about the project.

Install

The SDV is publicly available under the Business Source License. Install SDV using pip or conda. We recommend using a virtual environment to avoid conflicts with other software on your device.

pip install sdv
conda install -c pytorch -c conda-forge sdv

Getting Started

Load a demo dataset to get started. This dataset is a single table describing guests staying at a fictional hotel.

from sdv.datasets.demo import download_demo

real_data, metadata = download_demo(
    modality='single_table',
    dataset_name='fake_hotel_guests')

Single Table Metadata Example

The demo also includes metadata, a description of the dataset, including the data types in each column and the primary key (guest_email).

Synthesizing Data

Next, we can create an SDV synthesizer, an object that you can use to create synthetic data. It learns patterns from the real data and replicates them to generate synthetic data. Let's use the GaussianCopulaSynthesizer.

from sdv.single_table import GaussianCopulaSynthesizer

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(data=real_data)

And now the synthesizer is ready to create synthetic data!

synthetic_data = synthesizer.sample(num_rows=500)

The synthetic data will have the following properties:

  • Sensitive columns are fully anonymized. The email, billing address and credit card number columns contain new data so you don't expose the real values.
  • Other columns follow statistical patterns. For example, the proportion of room types, the distribution of check in dates and the correlations between room rate and room type are preserved.
  • Keys and other relationships are intact. The primary key (guest email) is unique for each row. If you have multiple tables, the connection between a primary and foreign keys makes sense.

Evaluating Synthetic Data

The SDV library allows you to evaluate the synthetic data by comparing it to the real data. Get started by generating a quality report.

from sdv.evaluation.single_table import evaluate_quality

quality_report = evaluate_quality(
    real_data,
    synthetic_data,
    metadata)
Generating report ...

(1/2) Evaluating Column Shapes: |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 9/9 [00:00<00:00, 1133.09it/s]|
Column Shapes Score: 89.11%

(2/2) Evaluating Column Pair Trends: |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 36/36 [00:00<00:00, 502.88it/s]|
Column Pair Trends Score: 88.3%

Overall Score (Average): 88.7%

This object computes an overall quality score on a scale of 0 to 100% (100 being the best) as well as detailed breakdowns. For more insights, you can also visualize the synthetic vs. real data.

from sdv.evaluation.single_table import get_column_plot

fig = get_column_plot(
    real_data=real_data,
    synthetic_data=synthetic_data,
    column_name='amenities_fee',
    metadata=metadata
)
    
fig.show()

Real vs. Synthetic Data

What's Next?

Using the SDV library, you can synthesize single table, multi table and sequential data. You can also customize the full synthetic data workflow, including preprocessing, anonymization and adding constraints.

To learn more, visit the SDV Demo page.

Credits

Thank you to our team of contributors who have built and maintained the SDV ecosystem over the years!

View Contributors

Citation

If you use SDV for your research, please cite the following paper:

Neha Patki, Roy Wedge, Kalyan Veeramachaneni. The Synthetic Data Vault. IEEE DSAA 2016.

@inproceedings{
    SDV,
    title={The Synthetic data vault},
    author={Patki, Neha and Wedge, Roy and Veeramachaneni, Kalyan},
    booktitle={IEEE International Conference on Data Science and Advanced Analytics (DSAA)},
    year={2016},
    pages={399-410},
    doi={10.1109/DSAA.2016.49},
    month={Oct}
}



The Synthetic Data Vault Project was first created at MIT's Data to AI Lab in 2016. After 4 years of research and traction with enterprise, we created DataCebo in 2020 with the goal of growing the project. Today, DataCebo is the proud developer of SDV, the largest ecosystem for synthetic data generation & evaluation. It is home to multiple libraries that support synthetic data, including:

  • πŸ”„ Data discovery & transformation. Reverse the transforms to reproduce realistic data.
  • 🧠 Multiple machine learning models -- ranging from Copulas to Deep Learning -- to create tabular, multi table and time series data.
  • πŸ“Š Measuring quality and privacy of synthetic data, and comparing different synthetic data generation models.

Get started using the SDV package -- a fully integrated solution and your one-stop shop for synthetic data. Or, use the standalone libraries for specific needs.

sdv's People

Contributors

amontanez24 avatar arashakhgari avatar aylr avatar csala avatar deathn0t avatar dyuliu avatar fealho avatar frances-h avatar github-actions[bot] avatar gsheni avatar jdtheripperpc avatar katxiao avatar kveerama avatar lajohn4747 avatar ludovicc avatar manuelalvarezc avatar npatki avatar pvk-developer avatar r-palazzo avatar rollervan avatar rwedge avatar sarahmish avatar sdv-team avatar srinify avatar tssbas avatar xamm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sdv's Issues

Separate Data Loading into another class

Description

  • Create another class responsible for loading data and returning a DataNavigator object.
  • Remove all loading logic from DataNavigator and move to this new class

Rename Variables

  • SDV version:
  • Python version:
  • Operating System:

Description

In DataNavigator:
transformed -> transformed_data
_parse_data -> _parse_meta_data

In Modeler:
model_type -> tuple of overall model type name, and list of parameters ie. ('GaussianCopula', ['GaussianUnivariate']}
sets -> conditional_data

SDV Modeler Index issue

  • Lines 54, 63 an 64 of Modeler should be changed to use iloc, because sometimes the index of the dataframe doesn't start at 0 and increase normally.

Use pypi versions of Copulas and RDT

This week two of the dependencies of the project have released new versions. We should check that everything works fine with the new versions and change the project dependencies to the newer versions.

Add support for modelling multi-parent tables

Currently SDV is not able to model nor sample a table with multiple foreign_keys, whether or not they are from different tables or the same one repeated.

We should find a way to model and sample such tables.

README fixes

After running the README step by step I found some issues that need to be fixed:

  • Formatting python snippets as such instead of bash

  • On install instructions, change conda instructions for vainilla venv, if it's really needed ( we can just put the normal install from sources instructions)

  • On the code examples, change import * with the concrete modules to import.

  • When showing the values of a dataframe:
    Β· Avoid using print as is redundant.
    Β· Don't print the whole dataframe, doing a transposed head (df.head(3).T), will be more readable.

  • When users_meta ( a nested dict) is obtained, is displayed using print, that flattens it, making it harder to understand it's structure. will be better to call it without print or use pprint instead.

  • On save_model, create folder models if it doesn't exist.

Even if it doesn't crash, at some points in the execution, warnings arise, solving the, will be a plus:

>>> modeler.model_database()
/home/xino/.virtualenvs/sdv_mit/lib/python3.6/site-packages/numpy/lib/function_base.py:3082: RuntimeWarning: invalid value encountered in subtract
  X -= avg[:, None]
/home/xino/.virtualenvs/sdv_mit/lib/python3.6/site-packages/pandas/core/frame.py:5550: RuntimeWarning: Degrees of freedom <= 0 for slice
  baseCov = np.cov(mat.T)
/home/xino/.virtualenvs/sdv_mit/lib/python3.6/site-packages/numpy/lib/function_base.py:3088: RuntimeWarning: divide by zero encountered in double_scalars
  c *= 1. / np.float64(fact)
/home/xino/.virtualenvs/sdv_mit/lib/python3.6/site-packages/numpy/lib/function_base.py:3088: RuntimeWarning: invalid value encountered in multiply
  c *= 1. / np.float64(fact)
/home/xino/Pythia/MIT/SDV/sdv/Modeler.py:83: RuntimeWarning: '>' not supported between instances of 'str' and 'int', sort order is undefined for incomparable objects
  extended_table = extended_table.append(row, ignore_index=True)
>>> sampler.sample_all()
/home/xino/.virtualenvs/sdv_mit/src/copulas/copulas/multivariate/GaussianCopula.py:88: RuntimeWarning: covariance is not positive-semidefinite.
  samples = np.random.multivariate_normal(clean_mean, clean_cov, size=s)
/home/xino/.virtualenvs/sdv_mit/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:1907: RuntimeWarning: invalid value encountered in add
  lower_bound = self.a * scale + loc

Enforce data constraints

After this issue is solved, we should be ready to enforce data constraints on sampled data.

In order to implement them, they should be checked after data is sampled and reverse_transformed but before is returned. It should be on sdv.Sampler.sample_rows as it's the common access to the process of sampling for the three public methods. The roadmap should be as follows:

1-. Create a method sdv.Sampler.check_constraints that gets a dataframe sampled and reverse transformed and return an array of indices corresponding to rows that fulfill constraints.

2-. Modify the method sdv.Sampler.sample_rows that handles the process of sampling, but before returning the result, checks the data fullfill the constraints, discard the rows that fail and samples again until it gets to the desired number of rows.

Remove Primary Key Requirement

  • SDV version:
  • Python version:
  • Operating System:

Description

SDV currently requires that a primary key be defined for every table in the meta. This is not actually necessary and the dependency should be removed.

Extended parameters not being passed up

Description

If there is a table that has a parent and a child, it currently isn't passing its added parameters up to its parent during modeling.

Fix should be on line 147

Fix hyper_fit_transform call

RDT changed the method in hyper transformer from hyper_fit_transform to fit_transform. SDV still calls the old method on line 108 of DataNavigator.py. This should be changed to call the new method.

Remove hyper transformer dependency from Modeler

Description

  • In Modeler.py, hyper_transformer is used to clean tables before modeling (remove added Nans). This dependency should be removed, and the imputing should be done within Modeler.py itself.

Improve copula parameters sampling

During the modeling of the database in sdv.Modeler, extensions are created for each row of the parent tables containing the parameters to model the children tables.

On sampling time, this extensions are sampled too and later the parameters extracted and used to create the models to sample the children rows.

When creating new models from the sampled parameters, sometimes the models are created with inconsistent values. So far the following have been found:

  1. The sampled covariance matrix may not be positive-semidefinite, which is a requirement for copulas.multivaritate.GaussianMultivariate copula, which raises this warning:

    sdv_mit/lib/python3.6/site-packages/copulas/multivariate/gaussian.py:199: RuntimeWarning: covariance is not positive-semidefinite.
       samples = np.random.multivariate_normal(means, clean_cov, size=size)
    
  2. If by any chance the sampled value for the std of the copulas.univariate.GaussianUnivariate distribution is negative or zero the value of the generated sampled will be np.nan

Sample Parents Using Copulas

  • SDV version:
  • Python version:
  • Operating System:

Description

Using the models generated by the modeler, we want to sample rows for parents. Every time a new row is sampled, the primary key and row should be stored, so that children can generate models for the primary keys.

Add Evaluation Metrics

Find a way to evaluate the output of SDV.

  • Time
  • Accuracy for numeric columns
  • Accuracy for categorical columns

Change the modeler save/load API.

Right now, the functionality to save/load a model, is invoked like this:

modeler.save_model('demo_model')
modeler = sdv.utils.load_model('sdv/models/demo_model.pkl')

Here we are saving and loading the same model.
A few problems arise:

1-. The input value on both function should be the same, to avoid confussions.

2-. The saving is done building a path relative to the file modeler, while the loading is done using the path as it comes. This can cause unexpected behavior for the end-user. Could we take this value from a configuration file?

3-. It makes little sense to have a function that loads a class instance as a standalone function in a separate model, when it could be a classmethod on the Modeler class

Ensure unicity on primary keys on different calls

Currently, primary keys are generated using exrex module and the regex from the meta.json file.The way it's implemented, if we sample a single time, we are guaranteed that the primary keys will be unique, however, if we sample more than once, it's possible to obtain again keys that have been returned in the previous call.

Should we ensure uniqueness in this scenario?
Note that if we do this, we will only be able to sample as many rows as different matches the regex allows, afterwards we'll need a way to reset the database before sampling anything else.

For example, if we had a dataset consisting of a single table, with a single column, which is the primary key with regex [1-5]{1}, then the following could happen:

>>>...
>>>first_samples = sampler.sample_all(num_rows=3)
>>>first_samples.T
   primary_key
0            1
1            2
2            3

# Then it's no guaranteed that if we sample a single row more, it's primary key will be neither 4 or 5
>>> second_sample = sampler.sample_all(num_rows=1)
>>> second_sample
   primary_key
0            3

Add testing datasets with more complex relationships

The current dataset we are using for unittesting is quite simple, and I'm afraid some issues may arise when working with datasets with more complex relations.

My proposal is to add datasets with:

  • A single parent and multiple childs
  • A child with multiple parents
  • Multiple multi-level relations (a child whose parents are also parents of some of the parents of the child,...)

This would help us to catch edge cases we may have not considered.

Add copula models to modeler.py

  • SDV version:
  • Python version:
  • Operating System:

Description

The modeler should now store copula models as it runs RCPA. RCPA should now add the flattened models to tables.

Generate multiple rows

  • SDV version:
  • Python version:
  • Operating System:

Description

Generate more that one row at a time.

Add get_dataframe and get_metadata functions to DataNavigator

Currently, it is unclear for how a user can access the dataframes or meta-data for a specific table. Functions should be added to data_navigator to make this easier.

-def get_dataframe(table_name): returns the dataframe for the specified table
-def get_meta_data(table_name): returns the meta information for the specified table

Fix NaN problem that happens when modelling

  • SDV version:
  • Python version:
  • Operating System:

Description

Covariance matrices are being filled with NaNs. This is likely because column of foreign key is being modeled (all values are the same which causes that column to become all NaNs when creating a copula from it).

What I Did

Running RCPA causes many of the copula models to get covariance matrices filled with NaNs

Add support for Vine Copulas as a modeler.

It could be useful to add support for different models. To archieve that we should:

  1. Wait for this issue of Copulas is done and released.

  2. Update our requirements to work with the latests version of copulas.

  3. Add a new method sdv.Modeler.flatten_dict, that gets a nested dictionary and returns it flattened:

     >>> nested_dict
     {
         'one_attribute': 0
         'nested_attribute': {
             'foo': 'bar
         }
     }
     
     >>> sdv.Modeler.flatten_dict(nested_dict)
    
     {
        'one_attribute': 0
        'nested_attribute__foo': 'bar'
     }
  4. Add a new method sdv.Sampler.unflatten_dict that does the exact opposite, that is:

     >>> assert nested_dict == sdv.Sampler.unflatten_dict(sdv.Modeler.flatten_dict(nested_dict))
     >>> assert flattened_dict = sdv.Modeler.flatten_dict(sdv.Sampler.unflatten_dict(flattened_dict))
  5. Change the behavior of sdv.Modeler.flatten_model in order for it recieve a modelas input and return a pandas.Series with the flattened model dict.

  6. Rename the distribution keyword on sdv.Modeler.__init__ to model_kwargs that defaults to None, but when its present its passed to model when instances are created.

  7. Change the behavior of sdv.Sampler._make_model_from_params to make that after the parameters have been retrieved from the parent_row are transform into a dictionary, passed to sdv.Sampler.unflatten_dict and the result to model.from_dict

Update SDV to work with Copula updates

  • Copulas changed the name of the classes that get referenced by SDV
  • Currently get following error from running SDV:
    ModuleNotFoundError: No module named 'copulas.multivariate.GaussianCopula'

Enforce coding standards.

Fix python standards violations in the project such as:

  • Invalid file names.
  • Docstrings improperly formatted.

Also:

  • Remove unused files.
  • Delete unused variables
  • Refactor repeated chunks of code
  • Make sure make test-all pass without issues

Sample Children Using Copulas

  • SDV version:
  • Python version:
  • Operating System:

Description

Users should be able to generate rows for child tables. These tables should have foreign keys that refer to primary keys actually generated by parents.

Create sampling logic w/ dummy values

  • SDV version:
  • Python version:
  • Operating System:

Description

add ability to sample table recursively, but using random values instead of having the model generate the values

Add TravisCI

Add TravisCI to run builds after each commit, merge and PR.

Set Copulas as dependency

  • SDV version:
  • Python version:
  • Operating System:

Description

Copulas library needs to be dependency. Should be able to use Copulas in SDV

What I Did

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.

Modeler parameter not being used (?)

SDV version: 0.1.0
Python version: 3.6
Operating System: Fedora release 28 (Twenty Eight)

Description

I was trying to use the univariate KDE. To do that, I tried to set the distribution parameter in sdv.Modeler constructor to sdv.univariate.KDEUnivariate. The fitted modeler still uses sdv.univariate.GaussianUnivariate.

What I Did

I ran the following code:

import pandas as pd
import numpy as np

from copulas.univariate import KDEUnivariate
from copulas.univariate import GaussianUnivariate
from copulas.multivariate import VineCopula
from copulas.multivariate import GaussianMultivariate
from copulas.multivariate.tree import TreeTypes

from sdv import Sampler
from sdv import Modeler
from sdv import CSVDataLoader
from functools import partial


data_loader = CSVDataLoader('boston.json')
dn = data_loader.load_data()
dn.transform_data()
modeler = Modeler(dn, distribution=KDEUnivariate)
modeler.model_database()
sampler = Sampler(dn, modeler)

I checked the distribution for TAX feature and it follows, in the synthetic data, a gaussian distribution, while in the original data it wasn't gaussian. To check that, I looked into both the modeler and the following KDE plots:

image

image

If you want to run the code, you can use the annexed CSV and JSON files.

boston-data.zip

NaNs in Covariance matrix for parent models

  • SDV version:
  • Python version:
  • Operating System:

Description

If you model the database, the parent models receive data with NaNs, and then end up with covariance matrices that have nans. This makes sampling impossible.

Two possible causes:

  1. Some parent primary keys are never referenced, so the extension is null.
  2. The covariance matrix for different conditional data may have different sizes. This is likely a bug in copulas where numpy.cov is taking the rows as variables instead of the columns.

Minor issues after code review

  • Repeated string values, should be defined as module-level constants. (ie, β€˜GENERATED_PRIMARY_KEY’ on sdv.Modeler)

  • On sdv.DataNavigator.DataNavigator : delete getter methods, access the attributes directly.

  • Change dict lookups for dict.get calls where possible.

  • Delete sdv.DataNavigator.DataNavigator.__init__: It simply calls super, so it does nothing by itself, and the call to super is done by inheritance.

  • Delete repeated methods sdv.Modeler.get_model and sdv.Modeler._get_model

  • On if statements, change comparison against empty set to comparison against object itself, like: if self.attribute instead of if self.attribute == set().

Prepare 0.1.0 release

This issue includes all the task that need to be done before the release of the 0.1.0 version:

  1. Installation works on a clean environment using make dist and installing the resulting tarball.
  2. Build passes with make test-all.
  3. Documentation includes necessary steps for installation and usage, minimal api reference and contributing guide.
  4. README examples works perfectly and reflexes the latest changes made to the project.

synthesize rows given some restrictions

  • SDV version: 0.1.0
  • Python version: 3.6
  • Operating System: Fedora release 28 (Twenty Eight)

Description

In the master thesis and old documentation, it is stated that users are able to sample from tables according to arbitrary conditions on certain features. In the current version, I can't find anything like this in the documentation.

What I Did

I looked into the documentation and the source code.

Notes

I might be wrong but, probably, SDV team is waiting for a PR on this: sdv-dev/Copulas#47

Add documentation/reference for meta.json

Currently there is no documentation about what a meta.json file should contain for a given dataset. A minimal documentation should contain:

  • Reference for datasets
  • Reference for fields
  • Examples for simple cases
  • Links to external sources (RDT and Dataset Manager)

Ignore foreign key when modelling

  • SDV version:
  • Python version:
  • Operating System:

Description]

When creating a copula model to get the conditional data, the column of the foreign key should be ignored. This is because all values will be the same and the copula model will be messed up by this

Create Flatten Model Function

  • SDV version:
  • Python version:
  • Operating System:

Description

Given a Copula model, there should be a function that converts its parameters into an array (flattens the model).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.