Git Product home page Git Product logo

conjurer's Introduction

Build Status Codacy Badge CII Best Practices DOI

Documentation

The official documentation for conjurer is at foyi

Feedback

Please share your feedback and feature requests by filling in this 2 question survey

🔔If you are looking for an easy to use GUI for generating synthetic data please check out the app UnReal 🎉

Author

Sidharth Macherla

License

This project is licensed under the MIT License - see the LICENSE file for details.

Statement of Need

Data science applications need data to prototype and demonstrate to potential clients. For such purposes, using production data is a possibility. However, it is not always feasible due to legal and/or ethical considerations. This resulted in a need for generating synthetic data. This need is the key motivator for the package conjurer.

Data across multiple domains are known to exhibit some form of seasonality, cyclicality and trend. Although there are synthetic data generation packages currently available, they focus primarily on synthetic versions of microdata containing confidential information or for machine learning purposes. There is a need for a more generic synthetic data generation package that helps for multiple purposes such as forecasting, customer segmentation, insight generation etc. This package conjurer helps in generating such synthetic data.

Installation instructions

Firstly, install R from here

From the R console, install the package by using the following code

install.packages('conjurer')

Example usage

The package page on CRAN(Comprehensive R Archive Network) is here. The reference manual is here. The package vignette with the detailed documentation for usage with illustrative examples is here

Community guidelines

For guidelines regarding code contributions, refer to CONTRIBUTING. For guidelines on reporting security vulnerabilities, refer to SECURITY

conjurer's People

Contributors

codacy-badger avatar sidharthmacherla avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

conjurer's Issues

New feature: Promotional data

Build promotional information. It is common for businesses to promote products. It is common in promotions to have a discounted price for the product. Businesses also track the impact of these promotions on sales. A new feature to assign transaction to on or off promotions is needed. A promotional data could be used as treatment data for medical domain.

New feature: Reconstruct matrix

This feature allows to reconstruct the matrix/dataframe based on number of clusters
💡 Given eigen values and vectors, build matrix.

New feature: Build String and Numeric data

Currently, a customer is identified by a customer id. Add more details such as name, email ID, age, gender, marital status, phone number, national identity number (eg: SSN). A sequence generator could be built so that it can generate phone numbers but can be used by medical sciences domain to generate gene sequences. Such capabilities will make the package generic enough to be used in multiple domains.

New feature: Graph data

Generating graph is helpful in multiple use cases. Eg: neural network problems, route optimization problems. Currently, the genTree function is a complete m-ary undirected graph.

New feature: Spatial data

A transaction currently does not have any store or online details. Add details such as store ID or online details.

New feature: Pattern

Currently, this package can generate string data(buildNames), numeric data(buildNum) and alpha numeric data(buildCust, buildProd). This new feature should be able to generate a pattern i.e. including special characters. Some use cases for this are phone numbers, passwords.

Model outcome based generation

This package currently enables generating data based on the descriptive statistics. A new approach could be to generate data based on a model performance measure.
Example:
For a given set of model type such as Logistic Regression, model performance measures such as R2 and the variable type, distribution etc., the corresponding data must be generated. This means that if the generated data is used to build the same model type, then its performance must be similar to what was asked of it.

Protect master branch

As the package is maturing 🚀 and the number installation has crossed 5K 🥳, the author would like to work towards allowing contributions via pull requests in the future. This means that there must be necessary checks in place to ensure that the master branch is always protected. Currently, the administrator can commit to master without any review needed. Add more protection to the master branch.

Start a pythonic implementation of conjurer

The R package installations have crossed 5K 🥳 on CRAN. The author believes that there could be a wider adoption of this approach if it is scaled to other languages. Start building a Python package that is a replica of the R package.

Methodology

Add a methodology vignette to explain the methodology for each function in a more detailed way. Add a DOI for that document.

License information is hidden on GitHub

GitHub usually shows the license type in the About section of the repository. For this repository, there just is View license, pointing to https://github.com/SidharthMacherla/conjurer/blob/0d303273aa60fb7fe2791cfa7ee15d02cd9e9f67/LICENSE.

Looking at its history (https://github.com/SidharthMacherla/conjurer/commits/master/LICENSE), af925b5#diff-c693279643b8cd5d248172d9c22cb7cf4ed163a3c98c8a3f69c2717edd3eacb7 basically erased all valuable license information from this file.

The license is still mentioned inside the README (although referring to the aforementioned LICENSE file for details, with the file being rather sparse on details).

It would be nice if the LICENSE file actually contained the corresponding information, with GitHub being able to correctly show the license (again) automatically.

New feature: Hierarchy

Currently, product details have SKU number and price. Add more details such as product category, sub category and sub sub category.

Add Citation

After the methodology document is published, Add a citation to point to that document instead of the standard R package citation information.

Semi supervised approach

This package currently uses supervised generation approach. Plan for semi-supervised approach.

New feature: add options to buildHierarchy

Currently, buildHierarchy uses the type equalSplit. Only m-ary complete graphs (eg:binary, tertiary) are generated. This function needs to be enhanced with options manual and automatic where trees of unequal splits can be generated. The changes need to be made at gen function level that hands over the output to build function level

Bug: Generating names based on custom training data doesn’t work

Generating names with buildNames() results in an error message if one specifies a custom data frame of names. Here’s a reprex:

library(conjurer)
df_names = data.frame(names = c("Oliver", "Jack", "Harry"),
                      stringsAsFactors = FALSE)
new_names = buildNames(df_names, numOfNames = 3, minLength = 5, maxLength = 7)
#> Error in unlist(alphaList, use.names = FALSE) : 
#> object 'alphaList' not found

R version 3.6.3 (2020-02-29)
Platform: x86_64-suse-linux-gnu (64-bit)
Running under: openSUSE Tumbleweed

Matrix products: default
BLAS: /usr/lib64/R/lib/libRblas.so
LAPACK: /usr/lib64/R/lib/libRlapack.so

locale:
[1] C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] conjurer_1.1.0

loaded via a namespace (and not attached):
[1] compiler_3.6.3

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.