Git Product home page Git Product logo

rappor's Introduction

RAPPOR

RAPPOR is a novel privacy technology that allows inferring statistics about populations while preserving the privacy of individual users.

This repository contains simulation and analysis code in Python and R.

For a detailed description of the algorithms, see the paper and links below.

Feel free to send feedback to [email protected].

Running the Demo

Although the Python and R libraries should be portable to any platform, our end-to-end demo has only been tested on Linux.

If you don't have a Linux box handy, you can view the generated output.

To setup your enviroment there are some packages and R dependencies. There is a setup script to install them: $ ./setup.sh Then to build the native components run: $ ./build.sh This compiles and tests the fastrand C extension module for Python, which speeds up the simulation.

Finally to run the demo run: $ ./demo.sh

The demo strings together the Python and R code. It:

  1. Generates simulated input data with different distributions
  2. Runs it through the RAPPOR privacy-preserving reporting mechanisms
  3. Analyzes and plots the aggregated reports against the true input

The output is written to _tmp/regtest/results.html, and can be opened with a browser.

Dependencies

R analysis (analysis/R):

Demo dependencies (demo.sh):

These are necessary if you want to test changes to the code.

Python client (client/python):

  • None. You should be able to just import the rappor.py file.

Platform:

  • R: tested on R 3.0.
  • Python: tested on Python 2.7.
  • OS: the shell script tests have been tested on Linux, but may work on Mac/Cygwin. The R and Python code should work on any OS.

Development

To run tests:

$ ./test.sh

This currently runs Python unit tests, lints Python source files, and runs R unit tests.

API

rappor.py is a tiny standalone Python file, and you can easily copy it into a Python program.

NOTE: Its interface is subject to change. We are in the demo stage now, but if there's demand, we will document and publish the interface.

The R interface is also subject to change.

The fastrand C module is optional. It's likely only useful for simulation of thousands of clients. It doesn't use cryptographically strong randomness, and thus should not be used in production.

Directory Structure

analysis/
  R/                 # R code for analysis
  cpp/               # Fast reimplementations of certain analysis
                     #   algorithms
apps/                # Web apps to help you use RAPPOR (using Shiny)
bin/                 # Command line tools for analysis.
client/              # Client libraries
  python/            # Python client library
    rappor.py
    ...
  cpp/               # C++ client library
    encoder.cc
    ...
doc/                 # Documentation
tests/               # Tools for regression tests
  compare_dist.R     # Test helper for single variable analysis
  gen_true_values.R  # Generate test input
  make_summary.py    # Generate an HTML report for the regtest
  rappor_sim.py      # RAPPOR client simulation
  regtest_spec.py    # Specification of test cases
  ...
build.sh             # Build scripts (docs, C extension, etc.)
demo.sh              # Quick demonstration
docs.sh              # Generate docs form the markdown in doc/
gh-pages/            # Where generated docs go. (A subtree of the branch gh-pages)
pipeline/            # Analysis pipeline code.
regtest.sh           # End-to-end regression tests, including client
                     #  libraries and analysis
setup.sh             # Install dependencies (for Linux)
test.sh              # Test runner

Documentation

Publications

Links

  • Google Blog Post about RAPPOR
  • RAPPOR implementation in Chrome
    • This is a production quality C++ implementation, but it's somewhat tied to Chrome, and doesn't support all privacy parameters (e.g. only a few values of p and q). On the other hand, the code in this repo is not yet production quality, but supports experimentation with different parameters and data sets. Of course, anyone is free to implement RAPPOR independently as well.
  • Mailing list: [email protected]

rappor's People

Contributors

ananthr avatar andychu avatar cpovirk avatar huahang avatar ilyamironov avatar lally avatar mdeshon-google avatar nlohmann avatar tkaitchuck avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rappor's Issues

pipeline/task_spec_test.py is broken

pipeline/task_spec_test.py depends on the existence of the files:
_tmp/counts/2015-12-01/exp_counts.csv
_tmp/counts/2015-12-01/gauss_counts.csv
_tmp/counts/2015-12-02/exp_counts.csv
_tmp/counts/2015-12-02/gauss_counts.csv

Which appear to be generated by:
pipeline/regtest.sh create-counts
which depends on:

  1. Being run from within the pipeline directory
  2. The existence of csv files under ../_tmp/python/demo1/1/
  3. The existence of csv files under ../_tmp/python/demo2/1/
  4. The existence of csv files under ../_tmp/python/demo3/1/

Which presumably comes from having previously run:
./regtest.sh run-seq '^demo' 1 python
and not cleaning up or running any other demos (which would cause cleanup to occur) afterwards.

HELP!!! How do I make connection by localhost:6789

No matter I type the command: ./run_app.sh [port] or ./run_app.sh, the connection cannot be made and the warning message is displayed on the terminal window.

Also, I check the directory carefully, all of them are correct.... But why do I still make some mistakes?

HELP!!!

refactoring: EstimateBloomCounts

Boolean/Basic RAPPOR have no cohorts.

It would be cleaner if EstimateBloomCounts didn't deal with cohorts. And then we can write a wrapper that does apply() over cohorts for string RAPPOR.

Also, in basic RAPPOR, it would be better to call this Denoise/SubtractNoise or something, because there's no bloom filter.

Choose stable hash function

Criteria for choosing, in order of importance:

  1. Should be stable, and well defined
  2. Should be available in all languages (e.g. JavaScript could be an issue), or easy to implement in all languages.
  3. Should be reasonably fast to simulate

Candidates:

  • md5 (since it's stable, and available)
  • murmur hash

Ruled out:

  • sha1: we are using this for the demo, but it seems to imply you need a cryptographic hash function, which we should avoid.
  • city hash: not stable/versioned

This probably deserves a doc... i.e. we should go through the rationale for the choice, possibly with some performance tests.

Note: the simulation can be profiled. Random number generation seems to be a bottleneck much more than hashing.

no method for coercing this S4 class to a vector

Hi,

I get the following error when I try to run demo.sh. What's wrong here?

Thanks,
Huahang

_____ 1.330 Parsing _tmp/cpp/demo3/case_map.csv
Error in as.vector(data) : 
  no method for coercing this S4 class to a vector
Calls: main ... as.matrix -> as.matrix.default -> array -> as.vector

Publish Shiny apps

We have a couple apps that show off RAPPOR.

They should go in apps/ , and have build/run instructions there.

issue with install ggplot2 on R 3.0.2

Hello all,

Thank you again for your work.

After building R-3.0.2 from source on both a ubuntu-16.04 and a ubuntu-14.4 machine, I still got the error message that ggplot2 could not be installed on this version of R. How may I work around this issue?

Warmly,
Jinzhao

Issues with apps/Readme.md instructions

From mironov:

I am trying to follow instructions in readme.md. First, it seems that the permissions for script (.sh) files are not set to executable. Second, there seem to be some path problems - after I got to http://0.0.0.0:6789, which launches fine and click on "Choose file", the R script fails with "cannot open file 'analysis/R/util.R': No such file or directory"

map cache shouldn't read stale data

If you rewrite the same map file, then you still will read an old .rda. Right now you have to take care always to use new filenames.

It only tests for the existence of the file, not the file's contents or timestamp or anything like that.

Add docs about randomness

Privacy depends on the unpredictability of random numbers, but the random numbers should be chosen by the app, since it's platform-dependent. We should have some docs about this.

It depends on the language -- e.g. Java and JavaScript have their own random APIs.

Demo App Engine app

We could do a real data collection using RAPPOR with an App Engine app, and open source the code for it.

Implement analysis tools in different languages

Right now the analysis is in a combination of Python and R.

For bigger volumes of data, or for settings where it may be awkward to run R, we can provide implementations of the analysis code in other languages.

Python client limited to k<=32

Attempting k=128 results in:
File "../tests/rappor_sim.py", line 238, in
main(sys.argv)
File "../tests/rappor_sim.py", line 231, in main
params1, params2, irr_rand, opts.assoc_testdata, csv_in, csv_out)
File "../tests/rappor_sim.py", line 121, in GenAssocTestdata
irr1 = string_encoder.encode(v1)
File "/usr/local/google/home/tkaitchuck/rappor2/rappor/client/python/rappor.py", line 335, in encode
_, _, irr = self._internal_encode(word)
File "/usr/local/google/home/tkaitchuck/rappor2/rappor/client/python/rappor.py", line 311, in _internal_encode
prr, irr = self._internal_encode_bits(bloom)
File "/usr/local/google/home/tkaitchuck/rappor2/rappor/client/python/rappor.py", line 261, in _internal_encode_bits
self.secret, to_big_endian(bits), self.params.prob_f,
File "/usr/local/google/home/tkaitchuck/rappor2/rappor/client/python/rappor.py", line 162, in to_big_endian
return struct.pack('>L', i)

trying with k=64 results in:
Traceback (most recent call last):
File "../tests/rappor_sim.py", line 238, in
main(sys.argv)
File "../tests/rappor_sim.py", line 231, in main
params1, params2, irr_rand, opts.assoc_testdata, csv_in, csv_out)
File "../tests/rappor_sim.py", line 121, in GenAssocTestdata
irr1 = string_encoder.encode(v1)
File "/usr/local/google/home/tkaitchuck/rappor2/rappor/client/python/rappor.py", line 335, in encode
_, _, irr = self._internal_encode(word)
File "/usr/local/google/home/tkaitchuck/rappor2/rappor/client/python/rappor.py", line 311, in _internal_encode
prr, irr = self._internal_encode_bits(bloom)
File "/usr/local/google/home/tkaitchuck/rappor2/rappor/client/python/rappor.py", line 262, in _internal_encode_bits
self.params.num_bloombits)
File "/usr/local/google/home/tkaitchuck/rappor2/rappor/client/python/rappor.py", line 201, in get_prr_masks
raise RuntimeError('%d bits is more than the max of %d', num_bits, len(d))
NameError: global name 'd' is not defined

fast_em.cc optimization potential

Instead of serializing the m x n x N cond_prob matrix to C++, we can serialize the m and n dimension separately (without the outer product).

We could either do the outer product up front in C++, or we could do it "lazily" on every EM step. This is more computation, but could actually speed things up because we would save a lot in memory bandwidth.

Error in array(x, c(length(x), 1L), if (!is.null(names(x))) list(names(x), : 'data' must be of a vector type, was 'NULL'

I am attempting to "Analyzing Branches in Single-Cell Trajectories" using a Seurat 3 object imported into monocle 2. I have made it through the entire tutorial successfully until I try to analyze the branches in single cell trajectories using the code below. When I try to do the command plot_genes_branched_heatmap I get the following error. How do I fix this?

Error in array(x, c(length(x), 1L), if (!is.null(names(x))) list(names(x), :
'data' must be of a vector type, was 'NULL'

data <- as(as.matrix(res@assays$RNA@data), 'sparseMatrix')
pd <- new('AnnotatedDataFrame', data = [email protected])
fData <- data.frame(gene_short_name = row.names(data), row.names = row.names(data))
fd <- new('AnnotatedDataFrame', data = fData)
cds <- newCellDataSet(data, phenoData = pd, featureData = fd, lowerDetectionLimit = 0.5, expressionFamily = negbinomial.size())
cds <- estimateSizeFactors(cds)
cds <- estimateDispersions(cds)
Removing 410 outliers
There were 50 or more warnings (use warnings() to see the first 50)
cds <- detectGenes(cds, min_expr = 0.1)
print(head(fData(cds)))
gene_short_name use_for_ordering num_cells_expressed
Mrpl15 Mrpl15 FALSE 149
Lypla1 Lypla1 FALSE 41
Tcea1 Tcea1 FALSE 119
Atp6v1h Atp6v1h FALSE 84
Rb1cc1 Rb1cc1 FALSE 65
4732440D04Rik 4732440D04Rik FALSE 33
expressed_genes <- row.names(subset(fData(cds), num_cells_expressed >= 10))
disp_table <- dispersionTable(cds)
unsup_clustering_genes <- subset(disp_table, mean_expression >= 0.1)
cds <- setOrderingFilter(cds, unsup_clustering_genes$gene_id)
plot_ordering_genes(cds)
Warning messages:
1: Transformation introduced infinite values in continuous y-axis
2: Transformation introduced infinite values in continuous y-axis
cds <- reduceDimension(cds, max_components = 2, num_dim = 6, reduction_method = 'tSNE', verbose = T)
Remove noise by PCA ...
Reduce dimension by tSNE ...

marker_genes <- row.names(subset(fData(cds),

  •                              gene_short_name %in% c("Cd63", "C1qa", "Ccr2",
    
  •                                                     "Apoe", "Sepp1","Pf4",
    
  •                                                     "Napsa",  "Clec12a",  "Fos",
    
  •                                                     "Junb",  "Dusp1")))
    

diff_test_res <- differentialGeneTest(cds[marker_genes,])
sig_genes <- subset(diff_test_res, qval < 0.1)
sig_genes[,c("gene_short_name", "pval", "qval")]
gene_short_name pval qval
C1qa C1qa 1.013719e-12 3.716969e-12
Pf4 Pf4 5.704000e-05 6.971555e-05
Clec12a Clec12a 9.993933e-12 2.748332e-11
Apoe Apoe 2.208108e-29 2.428918e-28
Napsa Napsa 1.403176e-06 2.204990e-06
Cd63 Cd63 2.004831e-04 2.004831e-04
Junb Junb 1.131497e-05 1.555808e-05
Ccr2 Ccr2 3.641001e-07 6.675169e-07
Fos Fos 4.840100e-15 2.662055e-14
Sepp1 Sepp1 4.030497e-09 8.867094e-09
Dusp1 Dusp1 7.190949e-05 7.910044e-05
MYOG_ID1 <- cds[row.names(subset(fData(cds), gene_short_name %in% c("C1qa", "Apoe"))),]
plot_genes_jitter(MYOG_ID1, grouping = "seurat_clusters", ncol= 2)
to_be_tested <- row.names(subset(fData(cds), gene_short_name %in% c("Apoe", "Fos", "C1qa")))
cds_subset <- cds[to_be_tested,]
diff_test_res <- differentialGeneTest(cds_subset)
diff_test_res[,c("gene_short_name", "pval", "qval")]
gene_short_name pval qval
C1qa C1qa 1.013719e-12 1.013719e-12
Apoe Apoe 2.208108e-29 6.624323e-29
Fos Fos 4.840100e-15 7.260150e-15
plot_genes_jitter(cds_subset,

  •               grouping = "seurat_clusters",
    
  •               color_by = "seurat_clusters",
    
  •               nrow= 1,
    
  •               ncol = NULL,
    
  •               plot_trend = TRUE)
    

Warning messages:
1: Computation failed in stat_summary():
Hmisc package required for this function
2: Computation failed in stat_summary():
Hmisc package required for this function
3: Computation failed in stat_summary():
Hmisc package required for this function
4: Computation failed in stat_summary():
Hmisc package required for this function
5: Computation failed in stat_summary():
Hmisc package required for this function
6: Computation failed in stat_summary():
Hmisc package required for this function

to_be_tested <- row.names(subset(fData(cds), gene_short_name %in% c("Apoe", "C1qa", "Fos")))
cds_subset <- cds[to_be_tested,]
diff_test_res <- differentialGeneTest(cds_subset, fullModelFormulaStr = "~sm.ns(Pseudotime)")
diff_test_res[,c("gene_short_name", "pval", "qval")]
gene_short_name pval qval
C1qa C1qa 2.145791e-08 2.145791e-08
Apoe Apoe 5.811938e-23 1.743581e-22
Fos Fos 2.683403e-16 4.025104e-16
plot_genes_in_pseudotime(cds_subset, color_by = "seurat_clusters")
diff_test_res <- differentialGeneTest(cds[marker_genes,], fullModelFormulaStr = "~sm.ns(Pseudotime)")
sig_gene_names <- row.names(subset(diff_test_res, qval < 0.1))
plot_pseudotime_heatmap(cds[sig_gene_names,],

  •                     num_clusters = 3,
    
  •                     cores = 1,
    
  •                     show_rownames = T)
    

plot_cell_trajectory(cds, color_by = "Pseudotime")
plot_cell_trajectory(cds, color_by = "seurat_clusters")
BEAM_res <- BEAM(cds, branch_point = 2, cores = 1)
Warning messages:
1: In if (progenitor_method == "duplicate") { :
the condition has length > 1 and only the first element will be used
2: In if (progenitor_method == "sequential_split") { :
the condition has length > 1 and only the first element will be used
BEAM_res <- BEAM_res[order(BEAM_res$qval),]
BEAM_res <- BEAM_res[,c("gene_short_name", "pval", "qval")]
plot_genes_branched_heatmap(cds[row.names(subset(BEAM_res,

  •                                               qval < 1e-4)),],
    
  •                         branch_point = 2,
    
  •                         num_clusters = 4,
    
  •                         cores = 1,
    
  •                         use_gene_short_name = T,
    
  •                         show_rownames = T)
    

Error in array(x, c(length(x), 1L), if (!is.null(names(x))) list(names(x), :
'data' must be of a vector type, was 'NULL'

ConstrainedLinModel can't solve least squares

This is using R v 3.3.3

I'm attempting to run the demo (or any example test), but consistently failing during ConstrainedLinModel. Here's an example suing regtest.sh:

Output:
./regtest.sh run-seq unif-small-typical
removed '_tmp/cpp/results.html'
removed '_tmp/cpp/test-instances.txt'
removed '_tmp/cpp/r-unif-small-typical/case_true_map.csv'
removed '_tmp/cpp/r-unif-small-typical/spec.txt'
removed '_tmp/cpp/r-unif-small-typical/case_params.csv'
removed '_tmp/cpp/r-unif-small-typical/case_unique_values.txt'
removed '_tmp/cpp/r-unif-small-typical/case_candidates.txt'
removed '_tmp/cpp/r-unif-small-typical/1/case_counts.csv'
removed '_tmp/cpp/r-unif-small-typical/1/case_true_values.csv'
removed '_tmp/cpp/r-unif-small-typical/1/rappor_sim.log'
removed '_tmp/cpp/r-unif-small-typical/1/case_reports.csv'
removed directory '_tmp/cpp/r-unif-small-typical/1'
removed '_tmp/cpp/r-unif-small-typical/case_map.csv'
removed directory '_tmp/cpp/r-unif-small-typical/1_report'
removed directory '_tmp/cpp/r-unif-small-typical'
removed '_tmp/cpp/test-cases.txt'
removed '_tmp/cpp/rows.html'
removed directory '_tmp/cpp'
mkdir: created directory '_tmp/cpp'

----- Setting up parameters and candidate files for r-unif-small-typical

mkdir: created directory '_tmp/cpp/r-unif-small-typical'
Done generating parameters for all test cases
mkdir: created directory '_tmp/cpp/r-unif-small-typical/1'

----- Generating reports (gen_reports.R)

----- Running RAPPOR C++ client (see rappor_sim.log for errors)

real 0m7.389s
user 0m5.953s
sys 0m1.435s

----- Summing RAPPOR IRR bits to get 'counts'

mkdir: created directory '_tmp/cpp/r-unif-small-typical/1_report'
Using png
Warning message:
In library(Cairo, quietly = TRUE, logical.return = TRUE) :
there is no package called ‘Cairo’
Loading required package: foreach
Loaded glmnet 2.0-16

Attaching package: ‘limSolve’

The following object is masked from ‘package:ggplot2’:

resolution

_____ 0.851 Parsing _tmp/cpp/r-unif-small-typical/case_map.csv
Error in (function (A = NULL, B = NULL, E = NULL, F = NULL, G = NULL, :
cannot solve least squares problem - A and B not compatible
Calls: main ... FitDistribution -> ConstrainedLinModel -> do.call ->
Timing stopped at: 0.02 0.004 0.023
Execution halted
Running compare_dist.R took 0.981 seconds
Some test cases failed
Done running all test instances
Instances succeeded: 0 failed: 0 running: 0 total: 1
Wrote _tmp/cpp/results.html
URL: file:///usr/local/google/home/csharrison/rappor/_tmp/cpp/results.html

real 0m16.188s
user 0m14.454s
sys 0m1.750s

dashboard UI enhancements

  • union of top 5 per day, over all time
  • other: sum of rank 11+
  • total unallocated mass time series - DONE
  • error bars - DONE
  • can dygraphs sort labels by mass?

Add API for reporting values from a fixed set

Currently we are using bloom filters for reporting arbitrary strings. As mentioned in the paper, you can also pre-assign choices to individual bits.

Possible interface:

EnumReporter()
StringReporter()

etc.

Test/demo of longitudinal privacy

Comparing regular RAPPOR vs. one-time RAPPOR, we can show an attacker guessing the values for a particular user/client, and what fraction of time the guess is correct.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.