Git Product home page Git Product logo

featurecloud-synthetic-data-app's Introduction

Synthetic Data FeatureCloud App

Description

A Synthetic Data Feature Cloud App, generating synthetic data with the Synthetic Data Vault (SDV) library in Python [1].

Input

  • data.txt containing the original dataset (columns: features; rows: samples)

Output

  • synthetic_data.csv containing the synthetic dataset generated with the given parameters.

Workflow

Can be combined with the following apps:

  • Post:
    • Preprocessing apps (e.g. Cross-validation, Normalization ...)
    • Various analysis apps (e.g. Logistic Regression, Linear Regression ...)

Config

Use the config file to set the parameters for the synthetic data generation. Upload it together with your data that will be synthesized.

fc_synthetic_data: 
  local_dataset:
    data: data.txt
    sep: ","
  synthetic_data_vault:
    model: GaussianCopula
    number_of_rows: 300
    synthetize_fields:
      - age
      - workclass
      - education
      - education-num 
      - marital-status
      - occupation 
      - relationship
      - race
      - sex
      - capital-gain
      - capital-loss
      - hours-per-week
      - native-country
      - prediction
    categorical_fields:
      - workclass
      - education
      - education-num 
      - marital-status
      - relationship
      - race
      - sex
      - native-country
      - prediction
    anonymize_fields:
      - occupation : job 
  result:
    file: synthetic_data.csv

The config file allows to specify the following:

  • the model for generating synthetic data, the options include: GaussianCopula, CTGAN, TVAE, CopulaGAN. The default model is GaussianCopula.
  • the number of rows to generate, if not specified the dafult value corresponds to the number of rows in the original dataset.

Similarly, under the option synthesize_fields, the user can specify the columns to be synthetized and under the option categorical_fields, the user can specify which columns are categorical. The data types of the other fields are inferred automatically.

Furthermore, under the option anonymize_fields, the user can create fake data for fields labeled as Personally Identifiable Information with the same statistical properties. To do this, as shown in the configuration example indicate the name of the field and the category. For checking the possible categories, we refer the reader to Python Faker Documentation.

For more information, we refer the reader to the SDV Documentation.

Resources

[1]. N. Patki, R. Wedge, and K. Veeramachaneni, The Synthetic Data Vault., IEEE International Conference on Data Science and Advanced Analytics (DSAA), 2016,pp. 399-410, doi: 10.1109/DSAA.2016.49.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.