Git Product home page Git Product logo

alibaba-dp-workload's Introduction

Alibaba-DP Workload

This repository contains Alibaba-DP, a differential privacy (DP) workload based on the 2020 Alibaba GPU Trace. The original trace contains resource demands and allocation for tasks running on a production cluster. We use these metrics as proxies for task privacy resource demands, which do not exist in the original trace.

The Alibaba-DP workload can be used as a macrobenchmark to evaluate scheduling algorithms for differential privacy, such as DPF, from PrivateKube / Privacy Budget Scheduling (OSDI'21) or DPK, from Packing Privacy Budget Efficiently. Since there are no publicly available real-world traces of DP workloads, previous privacy scheduling algorithms were evaluated on purely synthetic microbenchmarks. The Alibaba-DP trace is at least partially backed by real-world workload patterns.

We hope this trace can help other DP researchers evaluate their own algorithms. Our work can be cited as:

@misc{tholoniat2022packing,
      title={Packing Privacy Budget Efficiently}, 
      author={Pierre Tholoniat and Kelly Kostopoulou and Mosharaf Chowdhury and Asaf Cidon and Roxana Geambasu and Mathias Lécuyer and Junfeng Yang},
      year={2022},
      eprint={2212.13228},
      archivePrefix={arXiv},
      primaryClass={cs.CR}
}

Methodology

Our methodology is detailed in Packing Privacy Budget Efficiently:

We use machine type (CPU/GPU) as a proxy for DP mechanism type. We assume CPU-based tasks correspond to mechanisms used for statistics, analytics, or lightweight ML (e.g. XGBoost or decision trees), while GPU-based tasks correspond to deep learning mechanisms (DP-SGD or DP-FTRL). We map each CPU-based task to one of the {Laplace, Gaussian, Subsampled Laplace} curves and each GPU-based task to one of the {composition of Subsampled Gaussians, composition of Gaussians} curves, at random. We use memory usage as a proxy for privacy usage by setting traditional DP ε as an affine transformation of memory usage (in GB hours). We don’t claim that memory will be correlated with privacy in a realistic DP workload, but that the privacy budget might follow a similar distribution (e.g. a power law with many tasks having small requests and a long tail of tasks with large requests). We compute the number of blocks required by each task as an affine function of the bytes read through the network. Unlike the privacy budget proxy, we expect this proxy to have at least some degree of realism when data is stored remotely: tasks that don’t communicate much over the network are probably not using large portions of the dataset. Finally, all tasks request the most recent blocks that arrived in the system and are assigned a weight of 1. We truncate the workload by sampling one month of the total trace and cutting off tasks that request more than 100 blocks or whose smallest normalized RDP ε is not in [0.001, 1]. The resulting workload, called Alibaba-DP, is an objectively derived version of the Alibaba trace.

The resulting trace is available as a .csv file here: sample_output/privacy_tasks_30_days.csv. The DP budget is expressed using Rényi DP, with normalized demands added for convenience.

Generating your own trace

We invite researchers to produce their own trace by modifying our default constants or using different proxies for the privacy resource in the generation script.

  1. Clone this repository and cd into it:
git clone https://github.com/columbia/alibaba-dp-workload.git
cd alibaba-dp-workload
  1. Create a new virtual environment with the dependencies, e.g. with Poetry:
# Install Poetry if necessary
curl -sSL https://install.python-poetry.org | python3 -

# Activate the environment and install the dependencies
poetry shell 
poetry install
  1. Download the original GPU trace from Alibaba:
bash download_alibaba_data.sh
  1. Generate a workload with python alibaba_privacy_workload/generate.py.

It takes about 15-45 mins on 32 cores. The output is a DP workload stored in outputs/privacy_tasks_{cfg.n_days}_days.csv.

Example usage:

python alibaba_privacy_workload/generate.py max_number_of_tasks=1000

python alibaba_privacy_workload/generate.py profits.default=0.5 epsilon_min=0.1

python alibaba_privacy_workload/generate.py --help

alibaba-dp-workload's People

Contributors

tholop avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

xlcbingo1999

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.