Git Product home page Git Product logo

differential-privacy-library's Introduction

Diffprivlib v0.6

Python versions Downloads PyPi version PyPi status General tests Documentation Status CodeQL codecov

Diffprivlib is a general-purpose library for experimenting with, investigating and developing applications in, differential privacy.

Use diffprivlib if you are looking to:

  • Experiment with differential privacy
  • Explore the impact of differential privacy on machine learning accuracy using classification and clustering models
  • Build your own differential privacy applications, using our extensive collection of mechanisms

Diffprivlib supports Python versions 3.8 to 3.12.

We're using the Iris dataset, so let's load it and perform an 80/20 train/test split.

from sklearn import datasets
from sklearn.model_selection import train_test_split

dataset = datasets.load_iris()
X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.2)

Now, let's train a differentially private naive Bayes classifier. Our classifier runs just like an sklearn classifier, so you can get up and running quickly.

diffprivlib.models.GaussianNB can be run without any parameters, although this will throw a warning (we need to specify the bounds parameter to avoid this). The privacy level is controlled by the parameter epsilon, which is passed to the classifier at initialisation (e.g. GaussianNB(epsilon=0.1)). The default is epsilon = 1.0.

from diffprivlib.models import GaussianNB

clf = GaussianNB()
clf.fit(X_train, y_train)

We can now classify unseen examples, knowing that the trained model is differentially private and preserves the privacy of the 'individuals' in the training set (flowers are entitled to their privacy too!).

clf.predict(X_test)

Every time the model is trained with .fit(), a different model is produced due to the randomness of differential privacy. The accuracy will therefore change, even if it's re-trained with the same training data. Try it for yourself to find out!

print("Test accuracy: %f" % clf.score(X_test, y_test))

We can easily evaluate the accuracy of the model for various epsilon values and plot it with matplotlib.

import numpy as np
import matplotlib.pyplot as plt

epsilons = np.logspace(-2, 2, 50)
bounds = ([4.3, 2.0, 1.1, 0.1], [7.9, 4.4, 6.9, 2.5])
accuracy = list()

for epsilon in epsilons:
    clf = GaussianNB(bounds=bounds, epsilon=epsilon)
    clf.fit(X_train, y_train)
    
    accuracy.append(clf.score(X_test, y_test))

plt.semilogx(epsilons, accuracy)
plt.title("Differentially private Naive Bayes accuracy")
plt.xlabel("epsilon")
plt.ylabel("Accuracy")
plt.show()

Differentially private naive Bayes

Congratulations, you've completed your first differentially private machine learning task with the Differential Privacy Library! Check out more examples in the notebooks directory, or dive straight in.

Contents

Diffprivlib is comprised of four major components:

  1. Mechanisms: These are the building blocks of differential privacy, and are used in all models that implement differential privacy. Mechanisms have little or no default settings, and are intended for use by experts implementing their own models. They can, however, be used outside models for separate investigations, etc.
  2. Models: This module includes machine learning models with differential privacy. Diffprivlib currently has models for clustering, classification, regression, dimensionality reduction and pre-processing.
  3. Tools: Diffprivlib comes with a number of generic tools for differentially private data analysis. This includes differentially private histograms, following the same format as Numpy's histogram function.
  4. Accountant: The BudgetAccountant class can be used to track privacy budget and calculate total privacy loss using advanced composition techniques.

Setup

Installation with pip

The library is designed to run with Python 3. The library can be installed from the PyPi repository using pip (or pip3):

pip install diffprivlib

Manual installation

For the most recent version of the library, either download the source code or clone the repository in your directory of choice:

git clone https://github.com/IBM/differential-privacy-library

To install diffprivlib, do the following in the project folder (alternatively, you can run python3 -m pip install .):

pip install .

The library comes with a basic set of unit tests for pytest. To check your install, you can run all the unit tests by calling pytest in the install folder:

pytest

Citing diffprivlib

If you use diffprivlib for research, please consider citing the following reference paper:

@article{diffprivlib,
  title={Diffprivlib: the {IBM} differential privacy library},
  author={Holohan, Naoise and Braghin, Stefano and Mac Aonghusa, P{\'o}l and Levacher, Killian},
  year={2019},
  journal = {ArXiv e-prints},
  archivePrefix = "arXiv",
  volume = {1907.02444 [cs.CR]},
  primaryClass = "cs.CR",
  month = jul
}

References

Acknowledgement

Work in this repository was partially supported by the European Union's Horizon 2020 research and innovation programme under grant number 951911 – AI4Media.

differential-privacy-library's People

Contributors

danrr avatar dgorelik avatar franziskuskiefer avatar imgbotapp avatar justanotherlad avatar miicheyang avatar mismayil avatar naoise-h avatar stefano81 avatar stevemar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

differential-privacy-library's Issues

Is the Wishart mechanism (and the LinearRegression model built on top of it) really private?

The documentation of the Wishart mechanism references the paper "Symmetric matrix perturbation for differentially-private principal component analysis" by Imtiaz and Sarwate.

Yet, two subsequent papers (at least) note that the analysis of this mechanism is incorrect and that it does not in fact guarantee differential privacy:

For some linear regression applications, I've tried replacing the Wishart noise with Laplace noise (as proposed in https://arxiv.org/pdf/1705.10829.pdf), and got consistently worse results, thus further suggesting that the noise added by the Wishart mechanism may be too optimistic.

Clarification needed: Is this library to be used strictly on list of elements?

Hello writers and maintainers:

The documentation pointing to dp.tools has functions that act on some kind of aggregation of array or array_like "vectors". I wonder if there is a way to have the private outcome applied to individual elements of such a vector rather than some kind of aggregation like std, sum etc.

I am more than happy to contribute such a feature if you guide me whether it belongs here. My suggestion is on the lines of a utility function in tools/utils.py

def individual_application(array, epsilon=1.0, accountant=None, ...):
  """works on each individual numeric item in the `array`.
  this could also take a dict instead of an `array`.
  """
  # implemenation ...

Thank you,
Aman

Handling of NaNs

Currently, the presence of NaNs in a dataset produces a distinguishing event, as for most functions (other than nanmean, nanvar, nanstd) the output will always be NaN. Is the best solution for all functions to just ignore NaNs (like nanmean, etc)?

For single-dimensional problems, there should be no issue, as removing NaNs is a simple deterministic pre-processing step. For multi-dimensional problems, removing data rows may prove problematic for utility. Is there justification here for doing something fancier, like mapping to a value within the range?

inf may also need special consideration, although these can usually be overcome when clipping the data, as inf will clip to the upper bound, and -inf to the lower bound. NaN has no obvious value to map to. When the algorithm requires the norm of a row to be clipped (like LogisticRegression), mapping from inf to a value is no longer trivial. Do we map inf to a value that ensures the row's norm matches the clip, or do we also scale the rest of the row?

Array index out of bounds in https://github.com/IBM/differential-privacy-library/blob/main/diffprivlib/tools/quantiles.py#L134

#48
However, in that case, len(probabilties)=len(array)-1 .

That still means, if rand becomes greater than all self._probabilities in line 169 of

, i.e, when idx = len(self._probabilities) as of line 174 of the same as per given conditions, line 134 of https://github.com/IBM/differential-privacy-library/blob/main/diffprivlib/tools/quantiles.py#L134 should still give an "array index out of bounds error" due to array[idx+1] reference, right, since idx=len(probabilities) and len(probabilties)=len(array)-1, which means array[idx+1] = array[len(array)] ?

Implementation of Random forest model

The current version of the library only supports 2 types of classification models: Naive Bayes and Logistic Regression. It would be great extend this to some other popular models such as random forest and decision tree.

Factor of 2 in objective for DP logistic regression

Hi, I wanted to check about a factor of 2 in the implementation regarding the conversion \Lambda = 1 / (2 n C) as discussed in a previous PR #10

In Corollary 11 of [CMS18] their regularization term N(f) = 1/2 ||f||^2 so their proof takes into account the 1/2 factor in the objective formulation (that is, J = 1/n sum(loss) + 1/2 \Lambda ||f||^2 ), just like in the sklearn objective which is J= C*sum(loss) + 1/2 w^t w . So if this is the case then the conversion should be \Lambda = 1 / (n C), and \Lambda = alpha / n, which implies in your code 0.5 * alpha would just be alpha
e.g.

epsilon_p = self.epsilon - 2 * np.log(1 + self.function_sensitivity * self.data_sensitivity /

Thank you

[question] About the gamma distribution noise used in the vector mechanism

Hello and thank you for providing such a wonderful privacy preserving library.

It seems that the gamma distribution is used for producing noise to add for the vector mechanism, but reading the paper of Differentially Private Empirical Risk Minimization I could not find such logic. Is there a proof in past papers for gamma distribution noise providing differential privacy?

I looked at other issues posting questions, and thought this was the best way to ask a question you guys.
Excuse me if I am misusing the github issues for this project, and thank you in advance.

Difficulty when running in parallel

Hello,

I'm trying to use mechanisms from diffprivlib and calling randomise methods in parallel, but I'm getting an error related to random library. Below there is a simple example using joblib library (the same used in diffprivlib.LogisticRegression) to execute it in parallel:

from joblib import Parallel, delayed
from diffprivlib.mechanisms import GeometricTruncated

mec = GeometricTruncated(epsilon=2, sensitivity=1, lower=1, upper=10)
path_func = delayed(mec.randomise)
res = list(Parallel(n_jobs=2, prefer='processes')(
    path_func(value) for value in range(10))
)

print(res)

I get the following error:

joblib.externals.loky.process_executor._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/ramon/anaconda3/lib/python3.8/site-packages/joblib/externals/loky/backend/queues.py", line 153, in _feed
    obj_ = dumps(obj, reducers=reducers)
  File "/home/ramon/anaconda3/lib/python3.8/site-packages/joblib/externals/loky/backend/reduction.py", line 271, in dumps
    dump(obj, buf, reducers=reducers, protocol=protocol)
  File "/home/ramon/anaconda3/lib/python3.8/site-packages/joblib/externals/loky/backend/reduction.py", line 264, in dump
    _LokyPickler(file, reducers=reducers, protocol=protocol).dump(obj)
  File "/home/ramon/anaconda3/lib/python3.8/site-packages/joblib/externals/cloudpickle/cloudpickle_fast.py", line 563, in dump
    return Pickler.dump(self, obj)
  File "/home/ramon/anaconda3/lib/python3.8/random.py", line 196, in __reduce__
    return self.__class__, (), self.getstate()
  File "/home/ramon/anaconda3/lib/python3.8/random.py", line 735, in _notimplemented
    raise NotImplementedError('System entropy source does not have state.')
NotImplementedError: System entropy source does not have state.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "a.py", line 6, in <module>
    res = list(Parallel(n_jobs=2, prefer='processes')(
  File "/home/ramon/anaconda3/lib/python3.8/site-packages/joblib/parallel.py", line 1061, in __call__
    self.retrieve()
  File "/home/ramon/anaconda3/lib/python3.8/site-packages/joblib/parallel.py", line 940, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/home/ramon/anaconda3/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
    return future.result(timeout=timeout)
  File "/home/ramon/anaconda3/lib/python3.8/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/home/ramon/anaconda3/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
_pickle.PicklingError: Could not pickle the task to send it to the workers.

I've already tried to use python multiprocessing.Pool but I also get the error System entropy source does not have state.
I've already tried to change the Parallel backend to "multiprocessing" instead of loky, but also get the same error.

I couldn't find solutions or related issues. Could you please help?

System settings used in the example:

  • OS: Ubuntu 21.10
  • python: 3.8.5
  • diffprivlib: 0.4.1

Privacy Proof for DP Logistic Regression

It appears as though the differentially private logistic regression algorithm differs slightly from the algorithm it is based on, proposed in the paper by Chaudhuri, Monteleoni and Sarwate.

Are the two algorithms equivalent? If not, is there a proof that shows the adapted algorithm is still differentially private?

Computing multiple quantiles always uses a new random state

When giving a list / array for quant in the diffprivlib.tools.quantile function, random_state that is given is not passed along to the calls that the function makes under the hood which means results vary between calls with the same random seed.

Reproduce

from diffprivlib.tools import quantile

import numpy as np

quantiles_1 = quantile([0, 1, 2, 3, 4], quant=[0.33, 0.66], random_state=0)
quantiles_2 = quantile([0, 1, 2, 3, 4], quant=[0.33, 0.66], random_state=0)
assert np.all(quantiles_1 == quantiles_2)

Currently this assertion will fail but of course should be the same between calls. I'm on version diffprivlib==0.6.2.

Fix
Although it's easy to fix this (pass random_state=random_state along with the call at line 117 of quantile.py) the current method uses more epsilon budget than is required when computing multiple quantiles. E.g. the algorithm in 'Differentially Private Quantiles' by Jennifer Gillenwater, Matthew Joseph, Alex Kulesza appears to score significantly better than independent exponential mechanism calls. Maybe this is something to consider?

Wishart mechanism should be removed as non-private

It seems diffprivlib still includes code and documentation for the Wishart mechanism which is known to be invalid. This was apparently previously addressed for linear regression in #23 and #25, but the invalid mechanism is still part of the code and listed in the documentation without any warnings.

diffprivlib failed to be installed in Windows 10 due to the crlibm dependency

Describe the bug
Cannot install diffprivlib in Windows because of the dependency in crlibm, whichi requires compilation in Windows.

To Reproduce
On Windows, pip install diffprivlib.
Make sure that gcc/g++ is installed and accessible in PATH (e.g., using MSYS2.

Expected behavior
The packaged diffprivlib is installed.

Screenshots

PS C:\projects\test> pip install crlibm --no-cache
Collecting crlibm
  Downloading crlibm-1.0.3.tar.gz (1.6 MB)
     |████████████████████████████████| 1.6 MB 6.4 MB/s
Building wheels for collected packages: crlibm
  Building wheel for crlibm (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: 'C:\Users\Shlomi\miniconda3\envs\synthflow\python.exe' -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\Shlomi\\AppData\\Local\\Temp\\pip-install-rc0mt5z4\\crlibm_02a44f0616a14f67b308a667aff2cb0d\\setup.py'"'"'; __file__='"'"'C:\\Users\\Shlomi\\AppData\\Local\\Temp\\pip-install-rc0mt5z4\\crlibm_02a44f0616a14f67b308a667aff2cb0d\\setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\Shlomi\AppData\Local\Temp\pip-wheel-c3zxz8jw'
       cwd: C:\Users\Shlomi\AppData\Local\Temp\pip-install-rc0mt5z4\crlibm_02a44f0616a14f67b308a667aff2cb0d\
  Complete output (42 lines):
  running bdist_wheel
  running build
  running build_ext
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "C:\Users\Shlomi\AppData\Local\Temp\pip-install-rc0mt5z4\crlibm_02a44f0616a14f67b308a667aff2cb0d\setup.py", line 189, in <module>
      setup(**metadata)
    File "C:\Users\Shlomi\miniconda3\envs\synthflow\lib\site-packages\setuptools\__init__.py", line 87, in setup
      return distutils.core.setup(**attrs)
    File "C:\Users\Shlomi\miniconda3\envs\synthflow\lib\site-packages\setuptools\_distutils\core.py", line 148, in setup
      return run_commands(dist)
    File "C:\Users\Shlomi\miniconda3\envs\synthflow\lib\site-packages\setuptools\_distutils\core.py", line 163, in run_commands
      dist.run_commands()
    File "C:\Users\Shlomi\miniconda3\envs\synthflow\lib\site-packages\setuptools\_distutils\dist.py", line 967, in run_commands
      self.run_command(cmd)
    File "C:\Users\Shlomi\miniconda3\envs\synthflow\lib\site-packages\setuptools\dist.py", line 1214, in run_command
      super().run_command(command)
    File "C:\Users\Shlomi\miniconda3\envs\synthflow\lib\site-packages\setuptools\_distutils\dist.py", line 986, in run_command
      cmd_obj.run()
    File "C:\Users\Shlomi\miniconda3\envs\synthflow\lib\site-packages\wheel\bdist_wheel.py", line 299, in run
      self.run_command('build')
    File "C:\Users\Shlomi\miniconda3\envs\synthflow\lib\site-packages\setuptools\_distutils\cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "C:\Users\Shlomi\miniconda3\envs\synthflow\lib\site-packages\setuptools\dist.py", line 1214, in run_command
      super().run_command(command)
    File "C:\Users\Shlomi\miniconda3\envs\synthflow\lib\site-packages\setuptools\_distutils\dist.py", line 986, in run_command
      cmd_obj.run()
    File "C:\Users\Shlomi\miniconda3\envs\synthflow\lib\site-packages\setuptools\_distutils\command\build.py", line 135, in run
      self.run_command(cmd_name)
    File "C:\Users\Shlomi\miniconda3\envs\synthflow\lib\site-packages\setuptools\_distutils\cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "C:\Users\Shlomi\miniconda3\envs\synthflow\lib\site-packages\setuptools\dist.py", line 1214, in run_command
      super().run_command(command)
    File "C:\Users\Shlomi\miniconda3\envs\synthflow\lib\site-packages\setuptools\_distutils\dist.py", line 986, in run_command
      cmd_obj.run()
    File "C:\Users\Shlomi\miniconda3\envs\synthflow\lib\site-packages\setuptools\_distutils\command\build_ext.py", line 305, in run
      self.compiler = new_compiler(compiler=self.compiler,
    File "C:\Users\Shlomi\miniconda3\envs\synthflow\lib\site-packages\setuptools\_distutils\ccompiler.py", line 1039, in new_compiler
      return klass(None, dry_run, force)
    File "C:\Users\Shlomi\AppData\Local\Temp\pip-install-rc0mt5z4\crlibm_02a44f0616a14f67b308a667aff2cb0d\setup.py", line 59, in __init__
      shared_option = "-shared" if self.ld_version >= "2.13" else "-mdll -static"
  AttributeError: 'Msys2CCompiler' object has no attribute 'ld_version'
  ----------------------------------------
  ERROR: Failed building wheel for crlibm

Root cause of the bug
I believe that the setup.py file of crlibm uses an old API of setuptools. In particular, it looks for the attribute ld_version (reference) that does not exists anymore for distutils.cygwinccompiler.CygwinCCompiler (I couldn't track the exact commit where this change was made, but this is the version I'm having ref).

System information (please complete the following information):

  • OS: Windows 10 Pro
  • Python version: 3.8 (using miniconda environment but installation with pip)
  • Shell: Powershell
  • diffprivlib version or commit number

Privacy-preserving parametrisation

At present, when parameters such as bounds and data_norm are not specified by the user, the values are simply read from the data, a clear violation of differential privacy (and flagged as such with a PrivacyLeakWarning). It is possible, however, to estimate these parameters using histograms, thanks to their low-sensitivity.

With clip_to_norm and clip_to_bounds a feature from diffprivlib 0.3, implementing these estimations can be done with greater ease.

Similar implementations to this can be found here and here.

predict_proba method

The predict_proba method of logistic regression (and I think maybe for all models) simply outputs 1 for the predicted label and 0 otherwise.

Upper Bound of _var() in diffprivlib/tools/utils.py

Line 422 of diffprivlib/tools/utils.py sets the output of _var() function as:
output = np.minimum(dp_mech.randomise(actual_var), (upper - lower) ** 2)

This basically means, the upper bound of variance is set to (upper - lower) ** 2 .

However, Popoviciu's inequality on variances states that var<= ((upper - lower) ** 2)/4 .

Is there a reason we are avoiding this division by a factor of 4, or is this a bug?

During one of my runs, I got the differentially private variance as 54.3917, whereas the actual variance was 9.1667.
This seems a little off, with too much noise added.
Perhaps we can set the upper bound as ((upper - lower) ** 2)/4 with the guarantees from Popoviciu's inequality?

If this is a bug, I can send a PR to fix this. Awaiting response. Thanks!

Screenshot from 2021-07-01 14-38-32

Bug of LinearRegression

There is a bug at line 115 of the /diffprivlib/models/linear_regression.py

95 def _construct_regression_obj(X, y, bounds_X, bounds_y, epsilon, alpha):
          ...
114    for i in range(n_targets):
115        sensitivity = np.abs(bounds_y[i]).max() ** 2

The bounds_y here is a tuple of size 2.
I guess this perhaps meant to be bounds_y[0][i], bounds_y[1][i].

Here is the output of the

image

image

image

Budget accounting for parallel composition

The BudgetAccountant class that is being introduced in the upcoming v0.3.0 currently only supports sequential composition. If we want to be able to use BudgetAccountant to track budget spend on the mechanism level (where the differential privacy operations actually take place), then we need to be able to support parallel composition (whereby the budget is determined as a max of the spent budgets, rather than the sum).

Random Number Generation

It looks like the default numpy RNG is being used to generate random values in the primitive mechanisms (e.g. the Laplace mechanism and its derivatives). As Aaron Roth et al. point out in their recent paper Guidelines for Implementing and Auditing Differentially Private Systems (Section 5.1):

“Random numbers” should be generated by a cryptographically secure pseudo-random number generator (CSPRNG). Mersenne Twister is a commonly used pseudo-random number generator for statistical applications (in many libraries, it is the default) but is considered insecure for differential privacy.

The secrets module (described in detail in PEP 506) makes it relatively easy to access such sources of randomness in a system-agnostic way. Would it be possible to either change the default RNG used in this module or add an option that allows a user to specify which RNG they would like to use?

For what it's worth, the following snippet is a minimal working example of some code that can be used to generate samples from the Uniform(0,1) distribution via the secrets module:

import secrets

def gen_unifs(n):
  rng = secrets.SystemRandom()
  unifs = [rng.uniform(0,1) for _ in range(n)]
  return unifs

In fact, the secrets module isn't really necessary here. The same functionality can be achieved via the random module:

import random

def gen_unifs2(n):
  rng = random.SystemRandom()
  unifs = [rng.uniform(0,1) for _ in range(n)]
  return unifs

The secrets module may, however, provide other functions that make it easier and/or cleaner to securely generate large numbers of random samples efficiently.

Add a Differentially-Private AdaBoost model

Is your feature request related to a problem? Please describe.
I've been trying to understand the paper, "Efficient, Noise-Tolerant, and Private Learning via Boosting" and it is difficult as the work is purely theoretical.

Describe the solution you'd like
I want to have an actual implementation of the boosting framework presented in this ^paper, for the purpose of furthering my own understanding.

Describe alternatives you've considered
So far it looks like TensorFlow Decision Forests has a module similar to what I'm looking for - but it looks like overkill to use it, subclass it, all to use a new kind of optimization algorithm (based on the paper I'm reading).

Additional context

  • I'm interested to understand this library better, so I've actually started trying to implement this on my fork. So far it includes (what I think) is the code for the model) and a demo notebook.
  • The main blockers I think I have right now are the following:
  1. reproducibility - when trying to run my current implementation (i.e., in adaboost.py) in the notebook (see cell 12 of plot_adaboost_twoclass.ipynb), it always seems to "bounce around" in training accuracy. E.g. 43% for the first run, then 47%, and eventually 0.00% within ~5 runs. I'm trying to understand if this is an expected result, and if not what can be done to fix it?
  2. testing - how should I go about writing tests for the adaboost.py?

Any other feedback/questions are appreciated!

Differential Private Percentiles

Hi,
I would like to know if there is any future planning to support differentially private percentile calculations, or is there a current implementation that can be used?

Thanks

DP Random Forest Performance Issues

Describe the bug

I encountered two issues while using the diffprivlib random forest classifier:

  1. I want to compare the influence of epsilon on the performance of a dp random forest classifier with a non-dp random forest classifier. For the non-dp random forest I use sci-kit learn's RandomForestClassifier. The issue is, that even for extremly high epsilons (100 to 10.000) the diffprivlib random forest does not approximate the sci-kit random forests performance.

  2. The diffprivlib random forest takes very long to train even when setting 'n_jobs=-1'. The sci-kit random forest trains for about 13/15s whereas the diffprivlib random forest trains for about 25/30 min. Shouldn't both trainings take about the same time?

To Reproduce

Please find my code below to reproduce the described behaviour. I use the Kaggle dataset from here: https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset

# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import diffprivlib as dpl

from sklearn import ensemble
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn import preprocessing

# Data handling
dataset = pd.read_csv('../cardio_train.csv', sep=';')
dataset.head()

X_raw = dataset.drop(['id', 'cardio'], axis=1)
X_raw_cat = X_raw.drop(['age', 'height', 'weight', 'ap_hi', 'ap_lo'], axis=1)
X_raw_cont = X_raw[['age', 'height', 'weight', 'ap_hi', 'ap_lo']]
y_raw = dataset['cardio']

normalize = preprocessing.StandardScaler()
ordinal = preprocessing.OrdinalEncoder()

X_cont = normalize.fit_transform(X_raw_cont)
X_cat = ordinal.fit_transform(X_raw_cat)

X = np.concatenate([X_cont, X_cat], axis=1)
y = y_raw

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Non-DP RF
model = ensemble.RandomForestClassifier(n_estimators=1000, oob_score=True, n_jobs=-1, random_state=1302)
model.fit(X_train, y_train)

y_train_pred = model.predict(X_train)
y_train_pred_oob = np.argmax(model.oob_decision_function_, axis=1)
y_test_pred = model.predict(X_test)

# DP RF
epsilon = 10000.0
model_dp = dpl.models.RandomForestClassifier(n_estimators=1000, epsilon=epsilon, n_jobs=-1).fit(X_train, y_train)

y_dp_train_pred = model_dp.predict(X_train)
y_dp_test_pred = model_dp.predict(X_test)

# Comparison
print(
metrics.accuracy_score(y_train, y_train_pred),
model.oob_score_,
metrics.accuracy_score(y_test, y_test_pred),
metrics.accuracy_score(y_train, y_dp_train_pred),
metrics.accuracy_score(y_test, y_dp_test_pred))

Expected behavior

The expected behaviour would be the dp-model approximating the non-dp models performance for high epsilons.

Screenshots

The models training times:
Bildschirmfoto 2022-05-05 um 16 25 33

The models performances:
Bildschirmfoto 2022-05-05 um 16 26 28

System information:

  • Linux 4.18.0 (x86-64)
  • Python 3.8.12
  • numpy 1.21.2
  • pandas 1.2.3
  • diffprivlib 0.5.1
  • sklearn 1.0.2

random forest bug

Hi, when I test random forest on a small dataset using your library, no any response for 1 day already. May you help test or check?
Screen Shot 2021-11-25 at 7 00 04 PM
?

DP Random Forest Classifier failed when applying predict function

Describe the bug
When implementing a differential private Random Forest Classifier then functions predict and predict_proba failed with the following error "ValueError: can only convert an array of size 1 to a Python scalar"

I'm using a dataset with both numeric and categorical columns that were already pre processed and when applied the same process to the Logistic regression classifier it works.

Expected behavior
I expected to given a X_test dataset return a 1-array with the model predictions

Screenshots
Work fine
image

Failed with the error "ValueError: can only convert an array of size 1 to a Python scalar"

image

image

System information (please complete the following information):

  • Windows
  • Python version 3.9.7
  • diffprivlib version or commit number 0.5.0
  • numpy version 1.21.5 / scikit-learn version 1.0.2

Why bounded laplace is not used for all libraries?

I have gone through the paper (very nice research, congratulations), however, I still have two questions.

I can understand that the variance must be non-negative by principle and therefore you apply the Bounded Laplace.
But, what I do not get is that you show the problem in an example with the count in Mars (cool haha), and then you only focus on using Bounded Lap for the variance.

Why do you not use Bounded Laplace for all your queries in your library? Because all the queries can have <0 zero even though it does not make sense.

With respect to the count_non_zero:

I imagine your library functions implemented in a lambda function in AWS, and a user would use it e.g. to get a count. Why limit the user to count only > 0 values?
Your function does not output an array, right? Then the user would not be able to do "array.size"; for that matter, in line with your logic, a user can also use "array.mean" in numpy.

Thank you again for your time and consideration.

nanmean is broken

Describe the bug

The nan* functions do not work due to a bug in the bound calculation
At this line, bounds are erroneously set to nan.
Thus, at this line, the condition becomes False, as np.allclose returns False by default with nan.
This makes this exception to trigger erroneously.

To fix: replace np.min and np.max with np.nanmin and np.nanmax

To Reproduce

Just run the code:

diffprivlib.tools.nanmean(np.array([0,1,2,3,4, np.nan]))

It raises:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/dist-packages/diffprivlib/tools/utils.py", line 267, in nanmean
    return _mean(array, epsilon=epsilon, bounds=bounds, axis=axis, dtype=dtype, keepdims=keepdims,
  File "/usr/local/lib/python3.8/dist-packages/diffprivlib/tools/utils.py", line 290, in _mean
    array = clip_to_bounds(np.ravel(array), bounds)
  File "/usr/local/lib/python3.8/dist-packages/diffprivlib/validation.py", line 195, in clip_to_bounds
    raise ValueError(f"For non-scalar bounds, input array must be 2-dimensional. Got {array.ndim} dimensions.")
ValueError: For non-scalar bounds, input array must be 2-dimensional. Got 1 dimensions.

Implementation of Decision Tree

Hi

I have a question around the implementation of private decision trees. You have a predict_proba function that should return the probabilities of each feature. However, in the paper you cite for constructing differentially private DT, Differentially Private Random Decision Forests using Smooth Sensitivity, the only implementation mentioned is to query for the majority class label. This is seen, for instance, in Algorithm 1 in the paper and in the abstract. Is the predict_proba method differentially private?

Sorry for just posing a question in the Issues area, but I could not find anywhere to ask about it.

percentile function raises a RuntimeError when given a list of identical values

Hi,

I've come across this issue (At least I think it's an issue)
When I try to calculate a percentile over a list of data which contains only one value (regardless of the number of occurrences it has), a RuntimeError will be raised with the following message:

RuntimeError: Can't find a candidate to return. Debugging info: Rand: <Random number between 0 and 1), Probabilities: [nan, nan, nan, (Depends on the length of the list)]

Is this behaviour expected?
If so, if there's a workaround or maybe a flag I can pass to the function to receive anything other than an error?

Thank you :)

Different lengths of probabilities and measure arrays in /mechanisms/exponential.py#L130

In

probabilities *= np.array(measure) if measure else 1
, probabilities and measure arrays seem to have different lengths in line 130.

This is because, as per line 132 of https://github.com/IBM/differential-privacy-library/blob/main/diffprivlib/tools/quantiles.py#L132, measure=list(interval_sizes) and as per line 125 interval_sizes = np.diff(array). So that makes len(measure)=len(array)-1.

On the other hand, as per line 131 of https://github.com/IBM/differential-privacy-library/blob/main/diffprivlib/tools/quantiles.py#L131, utility=list(-np.abs(np.arange(0, k + 1) - quant * k)), which makes len(utility)=k+1 (where k=len(array) as per line 121 of https://github.com/IBM/differential-privacy-library/blob/main/diffprivlib/tools/quantiles.py#L121).
Further, since probabilities array is constructed from utility array as per line128 of

probabilities = np.exp(scale * utility / 2)
, that makes len(probabilities)=len(array)+1 .

In that case, how can line 130 of

probabilities *= np.array(measure) if measure else 1
be consistent, if ultimately len(probabilities)=len(array+1) while len(measure)=len(array)-1 ?

Is this a bug?

using mechanisms individually

i have checked the documentation and could not find how to use mechanisms on a dataset without ML. is there such a feature? for instance, I just want to add gaussian mechanism to my dataset then check the rows.

Pythonify mechanisms

In its current (v0.2) format, mechanisms are parametrised using setter methods, resulting in clunky code and implementations. A more pythonic approach using properties and initialisation variables would simplify the code while allowing for the same functionality.

Current parametrisation:
Gaussian().set_epsilon_delta(0.5, 0.001).set_sensitivity(1)

Proposed parametrisation:
Gaussian(epsilon=0.5, delta=0.001, sensitivity=1)

This will require a significant amount of refactoring wherever mechanisms are used (i.e., all tools and models), and will break backwards compatibility. In the long run however, this should be a welcome change.

ML Support

I only see several ml algorithms in the example. May I know do you support other tree base algorithms such as XGBoost?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.