ssoelvsten / mlightdp Goto Github PK

An extensive implementation of the LightDP language of Zhang and Kifer used to prove algorithms to be privacy-preserving

License: MIT License

Makefile 1.10% OCaml 71.57% Standard ML 27.22% Dafny 0.12%

differential-privacy language-based-security programming-language verification

mlightdp's Introduction

mlightdp's People

Contributors

Stargazers

Watchers

mlightdp's Issues

T-OModulo Rule Update

If e1(1) and e1(2) are two adjacent values such that e1(1) −e1(2) = c, then
e1(1) −e1(2) ≡ c mod e2.
Hence, one could think to allow e1 have distance d and let the result have distance d % e2. The function may not be injective in all the integers, but it is in the ring of integers modulo e2.

Return value in postcondition

We do currently not support to name the return value to make it available in the scope of the post condition. This is currently of no issue, since no postconditions are needed in our examples, but one could expect our tool to be used to simultaneously prove functional correctness together with its privacy guarantees.

Allow wider usage of ^ operator

Currently we only allow the ^ operator to be used directly on variables with a star-type. In most other cases it is not necessary, but one might as well allow a wider usage.

Allow ^ on non-star type (treat is as a reference to whatever distance it has)

let eta1 : float[3] = Lap(1/3);
let eta2 : float[^eta1] = Lap(1/3);

Allow ^ on expressions (infer the distance by the exp typing rules)

let eta3 : float[^(eta1 + eta2)] = Lap(1/3);

As before ^ should only be allowed within a distance, not within the expressions

Verify PartialSum

The PartialSum example partial_sum.mldp currently contains the preconditions and invariants of Zhang and Kifer [ZK17], but Dafny is not able to verify the output.

Distance inference by mutable dependency tracking

Dependant on #6 .

We already track the mutable dependencies of random variables η. Currently we only use this information to discern whether the distance provided by the programmer is in conflict with the mutable dependencies.

Zhang and Kifer [ZK17] differentiate between three cases. In the first two cases, η may have any distance d, while in the third η may only have a distance ∗. One may extend our dependency tracking to change the distance of η in the environment to ∗ when recognising the third case applies to η.

Delete end-to-end test artifacts

The artifacts for the end-to-end tests are persisted, while they only are supposed to be used to move data between jobs. We need a fourth and last cleanup step.

See:
https://developer.github.com/v3/actions/artifacts/#delete-an-artifact

Non-zero distance function definitions

Currently, everything returned must have distance 0, but after #26 and #27 one can allow non-zero distance for helper functions. The main algorithm still has to return values only of distance 0 to make it provably privacy-preserving, but one should be able to differentiate between them within the syntax.

Solve for optimal distance variables with νZ

Dependant on #7 #10

In #7 we'd leave uninferrable distance variables as ghost arguments to the transformed program. One can find the optimal solution in which the distance variables satisfy all the proof obligations generated. Bjørner and Phan have extended the SMT solver Z3 to also be a MaxSMT solver νZ, which Zhang and Kifer [ZK17] successfully used to infer the optimal distances of the random variables. After integration with Z3 in #10 we should try and extend the integration to use νZ

Allow multiple zero-distance function definitions

It seems possible a user would like to define multiple helper functions for their algorithms. To start simple, it's better to start with no mutual recursion and then considering the following design decisions at first.

Keep ensuring/assumption of zero-distance return value.
The epsilon privacy should be set as part of the function call. Would it not be confusing if epsilon always was added as a first argument?
Maybe an ordering on the arguments should be ensured (hat variables always after * distance variables, and epsilon argument always first?)

Then one can extend this to be mutually recursive functions.

Customizable privacy bound

Currently we only support epsilon-private algorithms, but many algorithms, such as SmartSum in the presentation by Zhang and Kifer [ZK17], are c · epsilon-private for some positive constant c = 1. We suggest to add a new annotation that describes the privacy guarantee, which is then used to generate the assertions before all return statements. The SmartSum of [ZK17] would then have the added annotation

privacy: 2 * epsilon

If no such annotation has been provided, then epsilon is the default used.

Unable to generate list size precondition for lists of higher dimensions

If one creates an argument of type list(list(float[*])) the tool will crash with a NotImplemented exception.

Verify SmartSum

The SmartSum example partial_sum.mldp currently contains the preconditions and invariants of Zhang and Kifer [ZK17], but Dafny is not able to verify the output.

This may need #14 first to be implemented.

Dafny integrated into toolchain

Our tool could immediately call Dafny if the program is successfully transformed. If one also makes a one-to-one mapping between the line numbers of the original program and the transformed program, then Dafny errors can be given with the original line numbers.

Output worst-case privacy-budget cost on Laplace

To prove GapSparseVector correct Wang et al. [Wan+19] removed the nonlinearity introduced in the absolute value due to the T-Laplace . Their approach was to replace the |d| with its numerical upper bound of 2. This upper bound is true due to the precondition that ∀i < |q| : −1 ≤ ˆq[i] ≤ 1.

In general, one could automate major parts of this step. If there is a bound on a star typed variable x such that l ≤ ˆx ≤ u, then one can add the bound |l|+|u| to v? rather than |d|. If l and u are provided by the programmer, then one may want to prove Φ =⇒ l ≤ d ≤ u as an assert statement in the transformed program.

One should notice that max(|l|,|u|) ≤ |l|+|u| ≤ 2max(|l|,|u|), which makes it a pretty tight over-approximation. To infer l and u one might need first to look into #11 .

Add comments to MLightDP programs

Maybe one should even keep them throughout all processing to still output them everywhere you want?

Function overloading / polymorphic types

The Size function only works on lists of floats rather than any list.

Refinement Step

Dependant on #6 and #40 .

Implement the Refinement Step of Zhang and Kifer [ZK17]

Mutable dependency tracking should be on distance not expressions

The mutable dependency tracking in refine.ml is supposed to track the dependencies in the distance in the type and not the mutable dependencies in the expression assigned.

One might want to wait to fix this until after #6 .

Nonlinear arithmetic

To prove GapSparseVector correct Wang et al. [Wan+19] removed the nonlinearity introduced in the absolute value due to the T-Laplace. Their approach was to replace the |d| with its numerical upper bound of 2. This upper bound is true due to the precondition that

∀i < |q| : −1 ≤ ˆq[i] ≤ 1 .

In general, one could automate major parts of this step. If there is a bound on a star typed variable x such that

l ≤ ˆx ≤ u ,

then one may provide a variant of the Lap function that instead adds the bound |l|+|u| to v_epsilon rather than |d|. If l and u are provided by the programmer, then one may want to prove

Φ =⇒ l ≤ d ≤ u

as an assert statement in the transformed program.

Discharge proof obligations with Z3

During the differential privacy type checking we currently generate proof obligations of
numerical constraints, which are output into the resulting Dafny file. We experienced that Dafny has trouble with verifying a program, when it is requested to prove too many trivial statements such as 0 = 0, ˆq[i] − ˆq[i] = 0 etc. As long as no variables are involved, such as in the first expression, then it can be trivially discharged by computing the result. In the second expression one could also discharge this by normalising the equation and make terms cancel out. The proof obligations may become more complicated than that, at which point a much more sophisticated tool like Z3 would have to be used anyway. Hence, we suggest to merely discharge all proof obligations with Z3; if it fails we then output them to Dafny as we currently do.

Mutable input arguments

We currently require all function arguments to be immutable. One goal with our language was to minimise the differential privacy analysis going in the way of the wishes of the programmer; allowing the programmer to specify an argument as mutable will bring our language further in line with other
languages.

Include dafny verification in End-to-End Test

The current End-to-End test in .github/workflows/end-to-end.yml only covers as the compilation of all our examples by the tool. Both because it would be cool, but also to catch more subtle errors in our program we may want to setup Dafny as part of the Github Action.

Output assertion for Laplace privacy-budget cost bound

In the Visual Studio Code version of Dafny one does not need to do any of this, but rather only replaces the equality in the invariant to be a less or equal. When running Dafny from the console, one then also needs to change the outputted Dafny file to add the following line

assert Abs(if q[i] + eta2 >= t'
           then (1 as real) - q_hat0[i]
           else (0 as real)) as real <= 2.0;

or add the precondition as an invariant. In general, one could autogenerate this assertion. If there is a bound on a star typed variable x such that l ≤ ˆx ≤ u, then an assertion of the following form can be output

assert l <= q[i] <= u

One may look into this in the following order

Provide the l and/or u manually in the syntax
Autogenerate the optimal l and u bounds. This might need #11 .

Merge distance and expression in AST

After having done #6 one can merge the dist and the exp type which will probably be necessary to better deal with distances of lists #22 , fix #30 , and will also in general result in removal of some duplicated logic.

Type and distance inference for function parameters

One should be able to infer the base type of an argument to further lessen the annotation burden on the programmer. One may be able to infer the distances of function parameters with the refinement rules of Zhang and Kifer [ZK17], though one should be cautious when applying them. It may be safest to start with assuming these distances to be zero if nothing else is specified.

Separate base type and distance

We currently have distances and types combined in a single value with the following type definition in ast.mli

type 'd ty = IntTy   of 'd
           | FloatTy of 'd
           | BoolTy
           | ListTy  of 'd ty
           | Error

Which in retrospect has been giving us more headaches than is worth it. We should completely separate the base type from the distance. We then place the distance in the DeclStmt as a separate field.

Mutable lists of variable distance

Since no example of ours include a mutable list of type list(float[∗]) , we do currently only support immutable lists of that type. One quickly realises though, that removing this restriction is not an easy endeavour when concatenation and the empty list comes into the picture.

Infer non-zero distance function calls

The distance of a function output may rely on the distance of the function input. In other words, functions can, in general, be dependently typed and should support inferring the output distance of a function based on the input.

This most likely completely changes how one deals with functions in dp.ml if not also elsewhere too.

Mutable dependency tracking for normal variables

Zhang and Kifer [ZK17] assume all free variables in a normal variable are immutable. Currently, we do no such check.

One wonders though why they do this assumption. Based on their description of how to handle random variables and their proof of soundness, it seems it all is connected to whether the distance of a variable x merely is the inferred value or one has to use a ghost variable ^x to contain the backup of the old value before mutation.

One should hence think to extend the distance tracking of #8 to also include normal variables.

T-OTimes Rule Update

Zhang and Kifer [ZK17] mention the possibility to improve the T-OTimes rule to be more fine-grained than the requirement of e1 ⊗ e2 has type num[0] given e1 and e2 do too. They do so by referring to the work of Reed and Pierce [RP10] and Gaboardi et al. [Gab+13]. We forward this reference for integrating into MLightDP this more fine-grained version of the rule. One may especially want to take a look at page 5 in [Gab+13].

Categorical values and the Exponential Mechanism

Zhang and Kifer [ZK17] mention briefly the ability to extend the notion of distance to boolean values. Based on the PrivBernoulli algorithm [ZK17, Figure 13] they first sample η from a Uniform distribution in [0,1] and then let b be true if η > t ∈ [0,1] and false otherwise. This idea is easily extensible to categorical values of more than two values, such as enums, by use of the more general Exponential Mechanism [ZK17].

They provide a differential privacy type-checking rule and transformation for sampling from a uniform distribution, which should be possible to directly incorporate into MLightDP. One may further abstract this from the user to allow one to sample a random boolean that is true with probability p, such that they only need to write

let b := RandBool(p)

which in the refinement step is translated into

let eta = Uniform(0,1)
let b = eta <= p ? true : false

and finally in the differential privacy type checking step is proven differentiably private according to the rule as provided by Zhang and Kifer [ZK17].

Output Rust function

With a pretty printer on the original AST one could output a Rust file of the algorithm.

Add custom assertions in MLightDP programs

In reflection on #33, we may want to allow assert statements to be part of the source program.

Variable distance integers

We do currently only support values of type float to have distance ∗, which has sufficed for our current examples. Our tool would be more complete if a variable x could have type int[∗] , though treating the value of ˆx as being integral may prove less than trivial.

Free the Laplace

Both the language of Zhang and Kifer [ZK17] and of our current implementation requires the Lap function to be used in isolated lines. Inspired by the proof obligations created during the differential privacy type checking, we propose to replace Lap functions in an expression with fresh variables η_i . The η_i variables are then declared right before the line they originated on.

Then t’ of the many Sparse Vector examples can be instantiated in the single line

let t’ := t + Lap(3 / epsilon)

which then in the refinement step automatically is converted into

let eta1 := Lap(3 / epsilon)
let t’ := t + eta1

at which point the normal refinement rules can infer the distances of the many newly generated η_i variables. One should notice, that this transformation is sound, as it purposely adheres to the most conservative assumptions of Zhang and Kifer [ZK17].