borealisai / private-data-generation Goto Github PK

View Code? Open in Web Editor NEW

123.0 6.0 29.0 62 KB

A toolbox for differentially private data generation

License: Other

Shell 0.97% Python 99.03%

differential-privacy generative-adversarial-network generative-models graphical-models

private-data-generation's People

Stargazers

Watchers

private-data-generation's Issues

RuntimeError: Sizes of tensors must match except in dimension 0.

Hi,

I got this error message when running pate-gan with the breast cancer dataset.

Traceback (most recent call last): File "evaluate.py", line 144, in <module> lap_scale=opt.lap_scale, class_ratios=class_ratios, lr=1e-4)) File "/content/private-data-generation/models/pate_gan.py", line 104, in train fake = self.generator(torch.cat([z.double(), category], dim=1)) RuntimeError: Sizes of tensors must match except in dimension 0. Got 45 and 64 (The offending index is 0)

I have made sure to drop all the nan values and the values in the dataset are continuous.
Could you please shed some light on the issue?

Here's my code of running pate-gan

python evaluate.py --target-variable='target' \ --train-data-path=./data/breast_processed_train.csv \ --test-data-path=./data/breast_processed_test.csv \ --normalize-data pate-gan --enable-privacy \ --target-epsilon=1

Is the example output real?

Hi there,

First I wanted to say fantastic work, I'm looking forward to hopefully implementing this on some projects.

I've just run your example code:
python evaluate.py --target-variable='income' --train-data-path=./data/adult_processed_train.csv --test-data-path=./data/adult_processed_test.csv --normalize-data dp-wgan --enable-privacy --sigma=0.8 --target-epsilon=8

but my results are much lower than your example output.

`AUC scores of downstream classifiers on test data :

LR: 0.3808226623159139

Random Forest: 0.501662624031914

Neural Network: 0.43066009020256046

GaussianNB: 0.5190902722941861

GradientBoostingClassifier: 0.5755160128038637

Results were obtained on epoch 243, here's the final console output before training stopped:

Epoch : 283 Loss D real : 0.011110783401113983 Loss D fake : 0.010858841290446964 Loss G : 0.010988074410009374 Epsilon spent : 8.001855949312862

Any ideas why my output results are much lower and how I can fix this?

I did have another issue where the parser failed to pass the target variable to the pandas data frame of the train and test data in the evaluate.py. I fixed this by replacing all instances of opt.target_variable with 'income'. Not sure if the two issues are linked so I thought I would mention it.

Problems when installing the required packages

I'm installing the requirements in requirement.txt in a conda environment like this.

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main
_openmp_mutex             5.1                       1_gnu
bzip2                     1.0.8                h7b6447c_0
ca-certificates           2023.08.22           h06a4308_0
ld_impl_linux-64          2.38                 h1181459_1
libffi                    3.4.4                h6a678d5_0
libgcc-ng                 11.2.0               h1234567_1
libgomp                   11.2.0               h1234567_1
libstdcxx-ng              11.2.0               h1234567_1
libuuid                   1.41.5               h5eee18b_0
ncurses                   6.4                  h6a678d5_0
openssl                   3.0.12               h7f8727e_0
pip                       23.3            py310h06a4308_0
python                    3.10.13              h955ad1f_0
readline                  8.2                  h5eee18b_0
setuptools                68.0.0          py310h06a4308_0
sqlite                    3.41.2               h5eee18b_0
tk                        8.6.12               h1ccaba5_0
tzdata                    2023c                h04d1e81_0
wheel                     0.41.2          py310h06a4308_0
xz                        5.4.2                h5eee18b_0
zlib                      1.2.13               h5eee18b_0

However, it always encounters problems when trying to install pandas.

Collecting pandas==0.20.3 (from -r requirements.txt (line 6))
  Using cached pandas-0.20.3.tar.gz (10.4 MB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [15 lines of output]
      /tmp/pip-install-rm3spd3q/pandas_066c7fe344da4499ae83440107a7e8eb/setup.py:39: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
        import pkg_resources
      /home/quan-tran/anaconda3/envs/private-data-gen/lib/python3.10/site-packages/setuptools/__init__.py:84: _DeprecatedInstaller: setuptools.installer and fetch_build_eggs are deprecated.
      !!

              ********************************************************************************
              Requirements should be satisfied by a PEP 517 installer.
              If you are using pip, you can try `pip install --use-pep517`.
              ********************************************************************************

      !!
        dist.fetch_build_eggs(dist.setup_requires)
      error in pandas setup command: 'install_requires' must be a string or list of strings containing valid project/version requirement specifiers; Expected end or semicolon (after version specifier)
          pytz >= 2011k
               ~~~~~~~^
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

Anyone encountered the same problem? How did you solve it?

I wanted to ask if the steps in the dp_wgan.py should be reset to 0 after every epoch to compute the RDP. Because right now, steps just keeps increasing for each epoch, so the RDP after each epoch is including the SGM done for all previous epochs as well. Is this intentional?

category in model input

Hi, why do generator (and discriminator) inputs include category tensor? I don't understand torch.multinomial's reason, but if it represents labels, why is it forwarded to the generator and discriminator?

Regarding the Hyper-parameter sigma while computing the epsilon.

In the RDP accountant function, the noise multiplier is defined as the noise scale added to the gaussian for perturbing the gradients divided by the sensitivity which is Sigma*C/2C where sigma is the noise hyper-parameter, C is the clipping coefficient. However, in your RDP function, you only pass Sigma as the input for the noise multiplier. Shouldn't this be corrected or am I missing something ?

Different auc with the result

Hi I run demo:
python evaluate.py --target-variable='income' --train-data-path=./data/adult_processed_train.csv --test-data-path=./data/adult_processed_test.csv --normalize-data dp-wgan --enable-privacy --sigma=0.8 --target-epsilon=8

But I get very low auc:
AUC scores of downstream classifiers on test data :
LR: 0.3407119900995038
Random Forest: 0.2610576777879843
Neural Network: 0.3578788366713348
GaussianNB: 0.49315881768882325
GradientBoostingClassifier: 0.2606982987000339

May I know how to run the code to get your listed result?