sidharthmacherla / conjurer Goto Github PK

R Package to generate synthetic data.

Home Page: https://foyi.co.nz/documentation-of-r-package-conjurer/

License: MIT License

R 85.83% TeX 14.17%

dummy-data-generator r synthetic-dataset-generation rpackage synthetic-data synthetic-data-generation synthetic-tabular-data

conjurer's Issues

New feature: Hierarchy

Currently, product details have SKU number and price. Add more details such as product category, sub category and sub sub category.

Bug: Generating names based on custom training data doesn’t work

Generating names with buildNames() results in an error message if one specifies a custom data frame of names. Here’s a reprex:

library(conjurer)
df_names = data.frame(names = c("Oliver", "Jack", "Harry"),
                      stringsAsFactors = FALSE)
new_names = buildNames(df_names, numOfNames = 3, minLength = 5, maxLength = 7)
#> Error in unlist(alphaList, use.names = FALSE) : 
#> object 'alphaList' not found

R version 3.6.3 (2020-02-29)
Platform: x86_64-suse-linux-gnu (64-bit)
Running under: openSUSE Tumbleweed

Matrix products: default
BLAS: /usr/lib64/R/lib/libRblas.so
LAPACK: /usr/lib64/R/lib/libRlapack.so

locale:
[1] C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] conjurer_1.1.0

loaded via a namespace (and not attached):
[1] compiler_3.6.3

New feature: Spectral data

Generate spectral data to be used for signals such as EEG.

Currently, this package can generate string data(buildNames), numeric data(buildNum) and alpha numeric data(buildCust, buildProd). This new feature should be able to generate a pattern i.e. including special characters. Some use cases for this are phone numbers, passwords.

Protect master branch

As the package is maturing 🚀 and the number installation has crossed 5K 🥳, the author would like to work towards allowing contributions via pull requests in the future. This means that there must be necessary checks in place to ensure that the master branch is always protected. Currently, the administrator can commit to master without any review needed. Add more protection to the master branch.

Model outcome based generation

This package currently enables generating data based on the descriptive statistics. A new approach could be to generate data based on a model performance measure.
Example:
For a given set of model type such as Logistic Regression, model performance measures such as R² and the variable type, distribution etc., the corresponding data must be generated. This means that if the generated data is used to build the same model type, then its performance must be similar to what was asked of it.

Code: Add slope and independent variable ranges to indepAndDep

Currently, the buildModelData only sources slopes from the model object. Add the intercept and the range information of the independent variable.

Publish article

Publish article to share this package details

New feature: Reconstruct matrix

This feature allows to reconstruct the matrix/dataframe based on number of clusters
💡 Given eigen values and vectors, build matrix.

Allocate products to hierarchy based on probability

The documentation suggests that the products can be mapped to the hierarchy randomly, i.e. evenly. This needs to change to pareto based mapping.

Start a pythonic implementation of conjurer

The R package installations have crossed 5K 🥳 on CRAN. The author believes that there could be a wider adoption of this approach if it is scaled to other languages. Start building a Python package that is a replica of the R package.

Add Citation

After the methodology document is published, Add a citation to point to that document instead of the standard R package citation information.

New feature: Graph data

Generating graph is helpful in multiple use cases. Eg: neural network problems, route optimization problems. Currently, the genTree function is a complete m-ary undirected graph.

License information is hidden on GitHub

GitHub usually shows the license type in the About section of the repository. For this repository, there just is View license, pointing to https://github.com/SidharthMacherla/conjurer/blob/0d303273aa60fb7fe2791cfa7ee15d02cd9e9f67/LICENSE.

Looking at its history (https://github.com/SidharthMacherla/conjurer/commits/master/LICENSE), af925b5#diff-c693279643b8cd5d248172d9c22cb7cf4ed163a3c98c8a3f69c2717edd3eacb7 basically erased all valuable license information from this file.

The license is still mentioned inside the README (although referring to the aforementioned LICENSE file for details, with the file being rather sparse on details).

It would be nice if the LICENSE file actually contained the corresponding information, with GitHub being able to correctly show the license (again) automatically.

New feature: Build String and Numeric data

Currently, a customer is identified by a customer id. Add more details such as name, email ID, age, gender, marital status, phone number, national identity number (eg: SSN). A sequence generator could be built so that it can generate phone numbers but can be used by medical sciences domain to generate gene sequences. Such capabilities will make the package generic enough to be used in multiple domains.

Add medical/biological use case in documentation

Although the package claims to be useful in multiple domains, the documentation speaks only about a retail use case. Add a use case from medical/biological sciences as well.

New feature: Spatial data

A transaction currently does not have any store or online details. Add details such as store ID or online details.

Semi supervised approach

This package currently uses supervised generation approach. Plan for semi-supervised approach.

Add Quantity to the transaction

Maybe use the base price as a mean and allow the user to set the standard deviation. I'll work on this as time permits.

Add pricing information in vignette

Currently, product price is generated but is not used as part of the use case in the vignette. This needs to be updated.

New feature: Promotional data

Build promotional information. It is common for businesses to promote products. It is common in promotions to have a discounted price for the product. Businesses also track the impact of these promotions on sales. A new feature to assign transaction to on or off promotions is needed. A promotional data could be used as treatment data for medical domain.

New feature: Temporal data

Time data. In conjunction with spatial data, this could enable spatio-temporal data

Methodology

Add a methodology vignette to explain the methodology for each function in a more detailed way. Add a DOI for that document.

New feature: Natural language

This feature request is to generate natural language. An example use case is customer review data for products.

New feature: add options to buildHierarchy

Currently, buildHierarchy uses the type equalSplit. Only m-ary complete graphs (eg:binary, tertiary) are generated. This function needs to be enhanced with options manual and automatic where trees of unequal splits can be generated. The changes need to be made at gen function level that hands over the output to build function level

sidharthmacherla / conjurer Goto Github PK

conjurer's Issues

Recommend Projects

Recommend Topics

Recommend Org