Git Product home page Git Product logo

Comments (6)

amad-person avatar amad-person commented on June 1, 2024

Hi @tr0p1x, thank you for opening this issue. I will look into the nonmember dataset initialization and get back to you.

I will also try to reproduce the issue with the intersection method. The tool was built using Python 3.6. Would it be possible for you to check if this error is happening even after changing your Python version to 3.6?

from ml_privacy_meter.

luigitropiano avatar luigitropiano commented on June 1, 2024

Hi @amad-person, thanks for your reply.

I confirm I was able to reproduce the issue also with python-3.6.13, numpy-1.18.1 and tensorflow-2.1.2:

Python 3.6.13 (default, Mar 10 2021, 10:46:47) 
[GCC 10.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> np.version.version
'1.18.1'
>>> import tensorflow as tf
>>> tf.version.VERSION
'2.1.2'
>>> 
>>> A = tf.constant([1,2,3,4])
>>> B = tf.constant([0])
>>> C = tf.constant([0])
>>> 
>>> hash(bytes(np.array(B)))
3369305363185413356
>>> hash(bytes(np.array(C)))
3369305363185413356
>>> 
>>> hash(bytes(np.array((A, B))))
-160109343909451668
>>> hash(bytes(np.array((A, C))))
-8002582249467291080

The problem with the datasets not being mutually exclusive also seems to be still there.
To help with this I am attaching the full dataset and the member-set I am using as input:

from ml_privacy_meter.

amad-person avatar amad-person commented on June 1, 2024

@tr0p1x thanks for checking, I will get back to you asap.

from ml_privacy_meter.

luigitropiano avatar luigitropiano commented on June 1, 2024

Hi @amad-person,
I'm back to this issue as I was able to prove that the problem I've experienced is due to different hashing of the same np.arrays in different parts of the program. Indeed, it seems that the hashes calculated by the compute_hashes function are different from the hashes calculated by the load_train, for the same arrays. I'll post the output of a test I ran that shows some of the hashes being computed differently:

['3.4,8.45,-0.47,-27.97,-24.89,-22.79,35.21,60.77,17.11,5.93,7.8,4.14,4.54,7.71,-0.03,6.84,11.49,4.16,-0.15,-1.47,0.02,34.0,32.0,73.0,-0.73,0.02,-0.36,39.0,24.0,58.0,46.75,132.06,17.31,0.51,-0.19,0.22,12.59,132.41,132.41,15.25,-40.98,-40.98,-39.37,87.7,87.7,91.29,9.36,9.36,9.55,132.36,132.36,16.69,132.74,132.74,17.99,-0.45,-0.45,-1.0,0.0,0.0,2.0,-0.35,-0.35,-0.1,22.0,22.0,18.0,17619.75,17619.75,323.74,1.0,-0.44,-0.44,188.14,3.4,8.45,-0.47,-27.97,-24.89,-22.79,35.21,60.77,17.11,5.93,7.8,4.14,4.54,7.71,-0.03,6.84,11.49,4.16,-0.15,-1.47,0.02,34.0,32.0,73.0,-0.73,0.02,-0.36,39.0,24.0,58.0,46.75,132.06,17.31,0.51,-0.19,0.22,12.59'
 '2']
compute_hashes -> 726796544415044867
['3.4,8.45,-0.47,-27.97,-24.89,-22.79,35.21,60.77,17.11,5.93,7.8,4.14,4.54,7.71,-0.03,6.84,11.49,4.16,-0.15,-1.47,0.02,34.0,32.0,73.0,-0.73,0.02,-0.36,39.0,24.0,58.0,46.75,132.06,17.31,0.51,-0.19,0.22,12.59,132.41,132.41,15.25,-40.98,-40.98,-39.37,87.7,87.7,91.29,9.36,9.36,9.55,132.36,132.36,16.69,132.74,132.74,17.99,-0.45,-0.45,-1.0,0.0,0.0,2.0,-0.35,-0.35,-0.1,22.0,22.0,18.0,17619.75,17619.75,323.74,1.0,-0.44,-0.44,188.14,3.4,8.45,-0.47,-27.97,-24.89,-22.79,35.21,60.77,17.11,5.93,7.8,4.14,4.54,7.71,-0.03,6.84,11.49,4.16,-0.15,-1.47,0.02,34.0,32.0,73.0,-0.73,0.02,-0.36,39.0,24.0,58.0,46.75,132.06,17.31,0.51,-0.19,0.22,12.59'
 '2']
load_train -> 8011021597571281556

['2.1,6.39,1.11,-17.02,-22.31,-13.07,11.49,65.01,6.7,3.39,8.06,2.59,0.98,1.83,0.6,3.99,10.28,2.82,1.12,-1.27,0.59,40.0,30.0,50.0,1.25,0.69,0.99,33.0,25.0,44.0,15.9,105.75,7.94,0.85,0.77,0.68,8.0,114.39,114.39,10.55,-10.99,-10.99,-12.63,5.46,5.46,7.77,2.34,2.34,2.79,114.49,114.49,10.26,114.41,114.41,10.91,-0.62,-0.62,-0.42,0.0,0.0,0.0,-0.16,-0.16,0.33,26.0,26.0,15.0,13090.7,13090.7,119.02,1.0,-0.16,-0.16,162.14,2.1,6.39,1.11,-17.02,-22.31,-13.07,11.49,65.01,6.7,3.39,8.06,2.59,0.98,1.83,0.6,3.99,10.28,2.82,1.12,-1.27,0.59,40.0,30.0,50.0,1.25,0.69,0.99,33.0,25.0,44.0,15.9,105.75,7.94,0.85,0.77,0.68,8.0'
 '3']
compute_hashes -> 2258577747523314709
['2.1,6.39,1.11,-17.02,-22.31,-13.07,11.49,65.01,6.7,3.39,8.06,2.59,0.98,1.83,0.6,3.99,10.28,2.82,1.12,-1.27,0.59,40.0,30.0,50.0,1.25,0.69,0.99,33.0,25.0,44.0,15.9,105.75,7.94,0.85,0.77,0.68,8.0,114.39,114.39,10.55,-10.99,-10.99,-12.63,5.46,5.46,7.77,2.34,2.34,2.79,114.49,114.49,10.26,114.41,114.41,10.91,-0.62,-0.62,-0.42,0.0,0.0,0.0,-0.16,-0.16,0.33,26.0,26.0,15.0,13090.7,13090.7,119.02,1.0,-0.16,-0.16,162.14,2.1,6.39,1.11,-17.02,-22.31,-13.07,11.49,65.01,6.7,3.39,8.06,2.59,0.98,1.83,0.6,3.99,10.28,2.82,1.12,-1.27,0.59,40.0,30.0,50.0,1.25,0.69,0.99,33.0,25.0,44.0,15.9,105.75,7.94,0.85,0.77,0.68,8.0'
 '3']
load_train -> 7696395958901921247

['1.08,9.57,2.32,-0.36,-0.19,-0.27,0.01,0.0,0.0,0.08,0.03,0.05,1.09,9.57,2.33,1.09,9.57,2.33,-0.48,1.27,-0.06,0.0,0.0,0.0,-0.38,0.47,-0.24,13.0,5.0,8.0,1.18,91.49,5.41,0.04,-0.06,-0.56,9.9,294.63,294.63,6.33,-1.26,-1.26,-1.18,0.07,0.07,0.07,0.26,0.26,0.26,294.61,294.61,6.32,294.63,294.63,6.34,0.35,0.35,-0.08,0.0,0.0,0.0,0.31,0.31,0.12,6.0,6.0,5.0,86805.89,86805.89,40.17,1.0,-0.28,-0.28,416.72,1.08,9.57,2.32,-0.36,-0.19,-0.27,0.01,0.0,0.0,0.08,0.03,0.05,1.09,9.57,2.33,1.09,9.57,2.33,-0.48,1.27,-0.06,0.0,0.0,0.0,-0.38,0.47,-0.24,13.0,5.0,8.0,1.18,91.49,5.41,0.04,-0.06,-0.56,9.9'
 '0']
compute_hashes -> 344646876975753846
['1.08,9.57,2.32,-0.36,-0.19,-0.27,0.01,0.0,0.0,0.08,0.03,0.05,1.09,9.57,2.33,1.09,9.57,2.33,-0.48,1.27,-0.06,0.0,0.0,0.0,-0.38,0.47,-0.24,13.0,5.0,8.0,1.18,91.49,5.41,0.04,-0.06,-0.56,9.9,294.63,294.63,6.33,-1.26,-1.26,-1.18,0.07,0.07,0.07,0.26,0.26,0.26,294.61,294.61,6.32,294.63,294.63,6.34,0.35,0.35,-0.08,0.0,0.0,0.0,0.31,0.31,0.12,6.0,6.0,5.0,86805.89,86805.89,40.17,1.0,-0.28,-0.28,416.72,1.08,9.57,2.32,-0.36,-0.19,-0.27,0.01,0.0,0.0,0.08,0.03,0.05,1.09,9.57,2.33,1.09,9.57,2.33,-0.48,1.27,-0.06,0.0,0.0,0.0,-0.38,0.47,-0.24,13.0,5.0,8.0,1.18,91.49,5.41,0.04,-0.06,-0.56,9.9'
 '0']
load_train -> -1813543256683344701

What I do not understand is why I cannot observe the previous behavior with every combination of full_dataset / memberset, but, given the same full_dataset I can distinguish different situations depending on the membersets. In the following examples, I'll show two cases with two different membersets: in the first one I always get intersections, while in the other one I never get any. All the details below:

The dataset has 5 output classes and samples with a shape of (111,) .
The datahandler was configured with batch_size=100 and attack_percentage=70

Case 1: with the following memberset I always get samples in the intersections of the computed datasets:

member_trainset samples: 7669
nonmember_trainset samples: 7669
member_testset samples: 3287
nonmember_testset samples: 3287

member_trainset ^ member_testset: 0
nonmember_trainset ^ nonmember_testset: 0
member_trainset ^ nonmember_trainset: ~2550
member_trainset ^ nonmember_testset: ~1080
member_testset ^ nonmember_testset: ~450
member_testset ^ nonmember_trainset: ~1080

Case 2: with the following memberset I never get any samples in the intersections of the computed datasets:

member_trainset samples: 7544
nonmember_trainset samples: 7544
member_testset samples: 3234
nonmember_testset samples: 3234

member_trainset ^ member_testset: 0
nonmember_trainset ^ nonmember_testset: 0
member_trainset ^ nonmember_trainset: 0
member_trainset ^ nonmember_testset: 0
member_testset ^ nonmember_testset: 0
member_testset ^ nonmember_trainset: 0

Both the memberset.npy are built in the same exact way and the only difference between them should be the samples they contain.

from ml_privacy_meter.

amad-person avatar amad-person commented on June 1, 2024

it seems that the hashes calculated by the compute_hashes function are different from the hashes calculated by the load_train, for the same arrays.

Hi @tr0p1x, could you try setting the python environment variable PYTHONHASHSEED to an integer value?

You can set it in the main file like this:

import os
os.environ['PYTHONHASHSEED'] = 0

Documentation for this environment variable is here.

Could you also post your script (if any) to find the intersection between the datasets? Thanks.

from ml_privacy_meter.

luigitropiano avatar luigitropiano commented on June 1, 2024

Hi @amad-person,
I was finally able to reproduce the issue.
It seems to be related to a different dtype of the ndarrays that are hashed by the compute_hashes and the load_train methods.
Indeed, equals ndarrays with different dtype are hashing differently:

>>> a = np.array(['1.0, 2.0, 3.0', '4'])
>>> a
array(['1.0, 2.0, 3.0', '4'], dtype='<U13')

>>> b = np.array(['1.0, 2.0, 3.0', '4'], dtype='<U14')
>>> b
array(['1.0, 2.0, 3.0', '4'], dtype='<U14')

>>> hash(bytes(a))
442301614551029001

>>> hash(bytes(b))
-8675837360597484159

The issue probably occurs because the dtype is automatically inferred by numpy most of the times, but it may actually be different between the dataset and the memberset (I think it is computed basing on the longest array in the set, where longest means with more characters, as they are passed as strings; however in my use case each array may have a different number of characters, and that's why it is not always working)

That would also explain why I could not observe the issue with every combination of full_dataset / memberset, but only with some. If the full_dataset and the memberset happen to have the same dtype, then all the ndarrays hash consistently and the issue is not observable. Otherwise, if the full-dataset and the membserset have a different dtype, equals ndarrays hash differently and the final computed sets are not mutually exclusive anymore.

One easy way to fix the issue may be to stop computing the non-memberset within this tool and require it to be passed as input together with the memberset, already computed.
Another fix may require changing how the hashes are computed, and avoiding to hash directly the ndarray.

As of your question, the function I use to find the intersection between the datasets is the following:

    def count_intersect(self, first, second):
        first = first.unbatch()
        second = second.unbatch()       
        m1, m2 = set(), set()

        for example in first:
            m1.add(np.array(example[0]).tostring())
            #m1.add(np.array2string(np.array(example[0])))
        for example in second:
            m2.add(np.array(example[0]).tostring())
            #m2.add(np.array2string(np.array(example[0])))
        return(str(len(m1.intersection(m2))))

and finally I add the following right before the main training procedure begins line:

        print('mtrains ^ mtest: ' + self.count_intersect(mtrainset, mtestset))
        print('nmtrains ^ nmtest: ' + self.count_intersect(nmtrainset, nmtestset))
        
        print('mtrains ^ nmtrain: ' + self.count_intersect(mtrainset, nmtrainset))
        print('mtest ^ nmtest: ' + self.count_intersect(mtestset, nmtestset))
    
        print('mtrains ^ nmtest: ' + self.count_intersect(mtrainset, nmtestset))
        print('nmtrains ^ mtest: ' + self.count_intersect(nmtrainset, mtestset))

Note that in the count_intersect method I am adding to the m1 and m2 sets only example[0], which is the first part of my arrays (the string with all the values except the label): this works as I know my dataset not to have duplicates.

Lastly, note that the second issue I've presented in the opening post of this issue, related to the intersection method in the attack_utils class is a different issue, but is also partially related to this one.

from ml_privacy_meter.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.