Comments (6)
Hi @tr0p1x, thank you for opening this issue. I will look into the nonmember
dataset initialization and get back to you.
I will also try to reproduce the issue with the intersection
method. The tool was built using Python 3.6
. Would it be possible for you to check if this error is happening even after changing your Python version to 3.6?
from ml_privacy_meter.
Hi @amad-person, thanks for your reply.
I confirm I was able to reproduce the issue also with python-3.6.13, numpy-1.18.1 and tensorflow-2.1.2:
Python 3.6.13 (default, Mar 10 2021, 10:46:47)
[GCC 10.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> np.version.version
'1.18.1'
>>> import tensorflow as tf
>>> tf.version.VERSION
'2.1.2'
>>>
>>> A = tf.constant([1,2,3,4])
>>> B = tf.constant([0])
>>> C = tf.constant([0])
>>>
>>> hash(bytes(np.array(B)))
3369305363185413356
>>> hash(bytes(np.array(C)))
3369305363185413356
>>>
>>> hash(bytes(np.array((A, B))))
-160109343909451668
>>> hash(bytes(np.array((A, C))))
-8002582249467291080
The problem with the datasets not being mutually exclusive also seems to be still there.
To help with this I am attaching the full dataset and the member-set I am using as input:
from ml_privacy_meter.
@tr0p1x thanks for checking, I will get back to you asap.
from ml_privacy_meter.
Hi @amad-person,
I'm back to this issue as I was able to prove that the problem I've experienced is due to different hashing of the same np.arrays in different parts of the program. Indeed, it seems that the hashes calculated by the compute_hashes function are different from the hashes calculated by the load_train, for the same arrays. I'll post the output of a test I ran that shows some of the hashes being computed differently:
['3.4,8.45,-0.47,-27.97,-24.89,-22.79,35.21,60.77,17.11,5.93,7.8,4.14,4.54,7.71,-0.03,6.84,11.49,4.16,-0.15,-1.47,0.02,34.0,32.0,73.0,-0.73,0.02,-0.36,39.0,24.0,58.0,46.75,132.06,17.31,0.51,-0.19,0.22,12.59,132.41,132.41,15.25,-40.98,-40.98,-39.37,87.7,87.7,91.29,9.36,9.36,9.55,132.36,132.36,16.69,132.74,132.74,17.99,-0.45,-0.45,-1.0,0.0,0.0,2.0,-0.35,-0.35,-0.1,22.0,22.0,18.0,17619.75,17619.75,323.74,1.0,-0.44,-0.44,188.14,3.4,8.45,-0.47,-27.97,-24.89,-22.79,35.21,60.77,17.11,5.93,7.8,4.14,4.54,7.71,-0.03,6.84,11.49,4.16,-0.15,-1.47,0.02,34.0,32.0,73.0,-0.73,0.02,-0.36,39.0,24.0,58.0,46.75,132.06,17.31,0.51,-0.19,0.22,12.59'
'2']
compute_hashes -> 726796544415044867
['3.4,8.45,-0.47,-27.97,-24.89,-22.79,35.21,60.77,17.11,5.93,7.8,4.14,4.54,7.71,-0.03,6.84,11.49,4.16,-0.15,-1.47,0.02,34.0,32.0,73.0,-0.73,0.02,-0.36,39.0,24.0,58.0,46.75,132.06,17.31,0.51,-0.19,0.22,12.59,132.41,132.41,15.25,-40.98,-40.98,-39.37,87.7,87.7,91.29,9.36,9.36,9.55,132.36,132.36,16.69,132.74,132.74,17.99,-0.45,-0.45,-1.0,0.0,0.0,2.0,-0.35,-0.35,-0.1,22.0,22.0,18.0,17619.75,17619.75,323.74,1.0,-0.44,-0.44,188.14,3.4,8.45,-0.47,-27.97,-24.89,-22.79,35.21,60.77,17.11,5.93,7.8,4.14,4.54,7.71,-0.03,6.84,11.49,4.16,-0.15,-1.47,0.02,34.0,32.0,73.0,-0.73,0.02,-0.36,39.0,24.0,58.0,46.75,132.06,17.31,0.51,-0.19,0.22,12.59'
'2']
load_train -> 8011021597571281556
['2.1,6.39,1.11,-17.02,-22.31,-13.07,11.49,65.01,6.7,3.39,8.06,2.59,0.98,1.83,0.6,3.99,10.28,2.82,1.12,-1.27,0.59,40.0,30.0,50.0,1.25,0.69,0.99,33.0,25.0,44.0,15.9,105.75,7.94,0.85,0.77,0.68,8.0,114.39,114.39,10.55,-10.99,-10.99,-12.63,5.46,5.46,7.77,2.34,2.34,2.79,114.49,114.49,10.26,114.41,114.41,10.91,-0.62,-0.62,-0.42,0.0,0.0,0.0,-0.16,-0.16,0.33,26.0,26.0,15.0,13090.7,13090.7,119.02,1.0,-0.16,-0.16,162.14,2.1,6.39,1.11,-17.02,-22.31,-13.07,11.49,65.01,6.7,3.39,8.06,2.59,0.98,1.83,0.6,3.99,10.28,2.82,1.12,-1.27,0.59,40.0,30.0,50.0,1.25,0.69,0.99,33.0,25.0,44.0,15.9,105.75,7.94,0.85,0.77,0.68,8.0'
'3']
compute_hashes -> 2258577747523314709
['2.1,6.39,1.11,-17.02,-22.31,-13.07,11.49,65.01,6.7,3.39,8.06,2.59,0.98,1.83,0.6,3.99,10.28,2.82,1.12,-1.27,0.59,40.0,30.0,50.0,1.25,0.69,0.99,33.0,25.0,44.0,15.9,105.75,7.94,0.85,0.77,0.68,8.0,114.39,114.39,10.55,-10.99,-10.99,-12.63,5.46,5.46,7.77,2.34,2.34,2.79,114.49,114.49,10.26,114.41,114.41,10.91,-0.62,-0.62,-0.42,0.0,0.0,0.0,-0.16,-0.16,0.33,26.0,26.0,15.0,13090.7,13090.7,119.02,1.0,-0.16,-0.16,162.14,2.1,6.39,1.11,-17.02,-22.31,-13.07,11.49,65.01,6.7,3.39,8.06,2.59,0.98,1.83,0.6,3.99,10.28,2.82,1.12,-1.27,0.59,40.0,30.0,50.0,1.25,0.69,0.99,33.0,25.0,44.0,15.9,105.75,7.94,0.85,0.77,0.68,8.0'
'3']
load_train -> 7696395958901921247
['1.08,9.57,2.32,-0.36,-0.19,-0.27,0.01,0.0,0.0,0.08,0.03,0.05,1.09,9.57,2.33,1.09,9.57,2.33,-0.48,1.27,-0.06,0.0,0.0,0.0,-0.38,0.47,-0.24,13.0,5.0,8.0,1.18,91.49,5.41,0.04,-0.06,-0.56,9.9,294.63,294.63,6.33,-1.26,-1.26,-1.18,0.07,0.07,0.07,0.26,0.26,0.26,294.61,294.61,6.32,294.63,294.63,6.34,0.35,0.35,-0.08,0.0,0.0,0.0,0.31,0.31,0.12,6.0,6.0,5.0,86805.89,86805.89,40.17,1.0,-0.28,-0.28,416.72,1.08,9.57,2.32,-0.36,-0.19,-0.27,0.01,0.0,0.0,0.08,0.03,0.05,1.09,9.57,2.33,1.09,9.57,2.33,-0.48,1.27,-0.06,0.0,0.0,0.0,-0.38,0.47,-0.24,13.0,5.0,8.0,1.18,91.49,5.41,0.04,-0.06,-0.56,9.9'
'0']
compute_hashes -> 344646876975753846
['1.08,9.57,2.32,-0.36,-0.19,-0.27,0.01,0.0,0.0,0.08,0.03,0.05,1.09,9.57,2.33,1.09,9.57,2.33,-0.48,1.27,-0.06,0.0,0.0,0.0,-0.38,0.47,-0.24,13.0,5.0,8.0,1.18,91.49,5.41,0.04,-0.06,-0.56,9.9,294.63,294.63,6.33,-1.26,-1.26,-1.18,0.07,0.07,0.07,0.26,0.26,0.26,294.61,294.61,6.32,294.63,294.63,6.34,0.35,0.35,-0.08,0.0,0.0,0.0,0.31,0.31,0.12,6.0,6.0,5.0,86805.89,86805.89,40.17,1.0,-0.28,-0.28,416.72,1.08,9.57,2.32,-0.36,-0.19,-0.27,0.01,0.0,0.0,0.08,0.03,0.05,1.09,9.57,2.33,1.09,9.57,2.33,-0.48,1.27,-0.06,0.0,0.0,0.0,-0.38,0.47,-0.24,13.0,5.0,8.0,1.18,91.49,5.41,0.04,-0.06,-0.56,9.9'
'0']
load_train -> -1813543256683344701
What I do not understand is why I cannot observe the previous behavior with every combination of full_dataset / memberset, but, given the same full_dataset I can distinguish different situations depending on the membersets. In the following examples, I'll show two cases with two different membersets: in the first one I always get intersections, while in the other one I never get any. All the details below:
The dataset has 5 output classes and samples with a shape of (111,) .
The datahandler was configured with batch_size=100 and attack_percentage=70
Case 1: with the following memberset I always get samples in the intersections of the computed datasets:
member_trainset samples: 7669
nonmember_trainset samples: 7669
member_testset samples: 3287
nonmember_testset samples: 3287
member_trainset ^ member_testset: 0
nonmember_trainset ^ nonmember_testset: 0
member_trainset ^ nonmember_trainset: ~2550
member_trainset ^ nonmember_testset: ~1080
member_testset ^ nonmember_testset: ~450
member_testset ^ nonmember_trainset: ~1080
Case 2: with the following memberset I never get any samples in the intersections of the computed datasets:
member_trainset samples: 7544
nonmember_trainset samples: 7544
member_testset samples: 3234
nonmember_testset samples: 3234
member_trainset ^ member_testset: 0
nonmember_trainset ^ nonmember_testset: 0
member_trainset ^ nonmember_trainset: 0
member_trainset ^ nonmember_testset: 0
member_testset ^ nonmember_testset: 0
member_testset ^ nonmember_trainset: 0
Both the memberset.npy are built in the same exact way and the only difference between them should be the samples they contain.
from ml_privacy_meter.
it seems that the hashes calculated by the compute_hashes function are different from the hashes calculated by the load_train, for the same arrays.
Hi @tr0p1x, could you try setting the python environment variable PYTHONHASHSEED
to an integer value?
You can set it in the main file like this:
import os
os.environ['PYTHONHASHSEED'] = 0
Documentation for this environment variable is here.
Could you also post your script (if any) to find the intersection between the datasets? Thanks.
from ml_privacy_meter.
Hi @amad-person,
I was finally able to reproduce the issue.
It seems to be related to a different dtype of the ndarrays that are hashed by the compute_hashes and the load_train methods.
Indeed, equals ndarrays with different dtype are hashing differently:
>>> a = np.array(['1.0, 2.0, 3.0', '4'])
>>> a
array(['1.0, 2.0, 3.0', '4'], dtype='<U13')
>>> b = np.array(['1.0, 2.0, 3.0', '4'], dtype='<U14')
>>> b
array(['1.0, 2.0, 3.0', '4'], dtype='<U14')
>>> hash(bytes(a))
442301614551029001
>>> hash(bytes(b))
-8675837360597484159
The issue probably occurs because the dtype is automatically inferred by numpy most of the times, but it may actually be different between the dataset and the memberset (I think it is computed basing on the longest array in the set, where longest means with more characters, as they are passed as strings; however in my use case each array may have a different number of characters, and that's why it is not always working)
That would also explain why I could not observe the issue with every combination of full_dataset / memberset, but only with some. If the full_dataset and the memberset happen to have the same dtype, then all the ndarrays hash consistently and the issue is not observable. Otherwise, if the full-dataset and the membserset have a different dtype, equals ndarrays hash differently and the final computed sets are not mutually exclusive anymore.
One easy way to fix the issue may be to stop computing the non-memberset within this tool and require it to be passed as input together with the memberset, already computed.
Another fix may require changing how the hashes are computed, and avoiding to hash directly the ndarray.
As of your question, the function I use to find the intersection between the datasets is the following:
def count_intersect(self, first, second):
first = first.unbatch()
second = second.unbatch()
m1, m2 = set(), set()
for example in first:
m1.add(np.array(example[0]).tostring())
#m1.add(np.array2string(np.array(example[0])))
for example in second:
m2.add(np.array(example[0]).tostring())
#m2.add(np.array2string(np.array(example[0])))
return(str(len(m1.intersection(m2))))
and finally I add the following right before the main training procedure begins line:
print('mtrains ^ mtest: ' + self.count_intersect(mtrainset, mtestset))
print('nmtrains ^ nmtest: ' + self.count_intersect(nmtrainset, nmtestset))
print('mtrains ^ nmtrain: ' + self.count_intersect(mtrainset, nmtrainset))
print('mtest ^ nmtest: ' + self.count_intersect(mtestset, nmtestset))
print('mtrains ^ nmtest: ' + self.count_intersect(mtrainset, nmtestset))
print('nmtrains ^ mtest: ' + self.count_intersect(nmtrainset, mtestset))
Note that in the count_intersect method I am adding to the m1 and m2 sets only example[0], which is the first part of my arrays (the string with all the values except the label): this works as I know my dataset not to have duplicates.
Lastly, note that the second issue I've presented in the opening post of this issue, related to the intersection method in the attack_utils class is a different issue, but is also partially related to this one.
from ml_privacy_meter.
Related Issues (20)
- Attack S and Attack P cant be reproduced HOT 3
- Question regarding ussage of ModelIntermediateOutput class in information_source_signal.py HOT 1
- Time HOT 1
- Can't achieve a better accuracy than 0.5121 with the blackbox tutorial: Running the Alexnet CIFAR-100 Attack HOT 5
- Pytorch implementation HOT 2
- pip install -r requirements.txt throws: ERROR: No matching distribution found for tensorflow-gpu==2.5.0 HOT 1
- Can't exploit gradients of ResNet-20 HOT 4
- can i attack linear regression、logistic、XGBoost
- can i attack linear regression、logistic、XGBoost models? HOT 1
- attacking convolutional layer's gradient - shape mismatch HOT 5
- MIA blackbox attack accuracy repeats same value HOT 3
- Code of "MIA via Distillation" HOT 1
- Blackbox attack of a basic binary TensorFlow classifier with tabular data HOT 1
- Request for FL and Unsupervised Learning Version HOT 1
- A question for attack_alexnet.py. HOT 1
- Old tutorials with restructured code HOT 1
- Add conda recipe HOT 8
- FileNotFoundError: [Errno 2] No such file or directory: '../privacy_meter/report_files/explanations.json' HOT 2
- Bug in notebook examples that use PyTorch models HOT 1
- Enhanced MIA HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ml_privacy_meter.