epfml / disco Goto Github PK

Decentralized & federated privacy-preserving ML training, using p2p networking, in JS

License: Apache License 2.0

HTML 0.09% JavaScript 0.41% Shell 0.17% CSS 1.52% Vue 44.22% Dockerfile 0.09% TypeScript 53.51%

privacy-preserving machine-learning deeplearning mobile federated-learning

disco's Introduction

DISCO - DIStributed COllaborative Machine Learning

DISCO leverages federated 🌟 and decentralized ✨ learning to allow several data owners to collaboratively build machine learning models without sharing any original data.

The latest version is always running on the following link, directly in your browser, for web and mobile:

🕺 https://epfml.github.io/disco/ 🕺

🪄 DEVELOPERS: Have a look at our developer guide

❓ WHY DISCO?

To build deep learning models across private datasets without compromising data privacy, ownership, sovereignty, or model performance
To create an easy-to-use platform that allows non-specialists to participate in collaborative learning

⚙️ HOW DISCO WORKS

DISCO has a public model – private data approach
Private and secure model updates – not data – are communicated to either:
- a central server : federated learning ( 🌟 )
- directly between users : decentralized learning ( ✨ ) i.e. no central coordination
Model updates are then securely aggregated into a trained model
See more HERE

❓ DISCO TECHNOLOGY

DISCO supports arbitrary deep learning tasks and model architectures, via TF.js
✨ relies on peer2peer communication
Have a look at how DISCO ensures privacy and confidentiality HERE

🧪 RESEARCH-BASED DESIGN

DISCO aims to enable open-access and easy-use distributed training which is

🌪️ efficient (R1, R2)
🔒 privacy-preserving (R3, R4)
🛠️ fault-tolerant and dynamic over time (R5)
🥷 robust to malicious actors and data poisoning (R6, R7)
🍎 🍌 interpretable in imperfectly interoperable data distributions (R8)
🪞 personalizable (R9)
🥕 fairly incentivize participation

🏁 HOW TO USE DISCO

Start by exploring our example DISCOllaboratives in the Tasks page.
The example models are based on popular datasets such as Titanic, MNIST or CIFAR-10
It is also possible to create your own task without coding on the custom training page:
- Upload the initial model
- You can choose from several existing dataloaders
- Choose between federated and decentralized for your DISCO training scheme ... connect your data and... done! 📊
- For more details on ML tasks and custom training have a look at this guide

Note: Currently only CSV and Image data types are supported. Adding new data types, preprocessing code or dataloaders, is accessible in developer mode (see developer guide).

JOIN US

You are welcome on slack

disco's People

Contributors

Stargazers

Watchers

disco's Issues

Bug: LUS-Covid task Model Storage

When I train a model on the LUS-Covid task (everything working properly and achieving a training accuracy of around 88% and validation accuracy of around 75%), then I leave the application and when I train the model again, I get a validation accuracy of 30% and a training accuracy of 50%.

I suspect there must be some problem with the creation of new models since the way I managed to fix it was delete the models from the storage and then refreshing the page and starting the application workflow from the task list.

Maybe the lus-covid task is not properly linked with the memory managers that were implemented afterwards.

Modularisation of the code + dynamic routing

Fix: Port numbers are hardcoded

At this moment, port numbers are being hardcoded in the communication_manager file. Inside this file we should always use the portNbr attribute passed in the constructor, and then each task should pass its corresponding port number to the communication_manager.

build a decentralized age detector from webcam images

as an example use-case, let's build a decentralized model for browser-based images

we can start from a stand-alone-version based on
https://github.com/justadudewhohacks/face-api.js/ (uses TF.js)
see for example
https://www.codeproject.com/Articles/5276827/AI-Age-Estimation-in-the-Browser-using-face-api-an

on top of it, we only need to extract gradients, to incorporate it into collaborative DeAI training. will also be a good use-case to refine our task-descriptions and image preprocessing pipelines and UI

Implement RelaySGD

We are currently using all-reduce scheme for model averaging. We should add the option to use RelaySGD. I will start working on this.

allow local data upload (for .csv) on client

after the client has joined a task (see issue #27 ), allow the client to upload (via an HTML form) a local training dataset to the browser local storage.

for .csv data (start e.g. with a standard public dataset, such as titanic or adult)
add checks if the uploaded data is compatible with the required feature format for the task description

Delete un-used components

With modularisation changes, some components are never used. Once every PR is merged to the master branch, delete these unwanted files (i.e ImageUploadFrame, GlobalTaskFrame, CSVUploadFrame ...).

switch from localstorage to indexedDB

Switch from local storage to indexedDB. This will allow a greater memory space for models.
Change serialization of the model: TensorFlow's has a different file architecture when a model is saved to local storage or indexedDB.
IndexedDB should be served by one separate script.

generic data processing capabilities I

Generalize and improve current data processing pipeline, with the goal to allow easy utilization of data as part of the projects, facilitate technical debugging, facilitate understanding of data science background and challenges and to separate those challenges from another.

In a minimalistic setting, this includes:

- Catching data related errors as soon as possible in the MLOps pipeline
- Providing functions that connect the tasks with the interface to include/exclude data types from loading
- Check on data loading success status

In addition, the following features would be nice to have: (now or later)
4) Ability to get data science style diagnosis of data on the interface (label distribution, before/after training evaluation)
5) capabilities to handle advanced bugs relating to data loading - handle corrupt/invalid data in an optimal way
6) minimize RAM/VRAM data and possibly provide overview on the UI

As a follow up, this would require current tasks/projects to be tested and checked again and verified, with possibly changes in the testing functions currently set up separately. Some tasks might be better connected to the UI and UI changes would improve the overall usability. (possibly in a separate issue)

If data loaders work similar to the python version, a lot of these tasks can be handled naturally, but some sub tasks would require follow ups.

evaluate libp2p as backend network stack

https://libp2p.io/

evaluate which bindings or deployments of libp2p would be most suitable for communicating gradients as in ML training, supporting mobile phone OSs, and for interfacing with e.g. PyTorch / jax or similar schemes

allow local data upload for image data

like #28, but for images

for mnist or other image data (images with labels, maybe one folder per label for simpler user experience?)
check how `local storage' or standard files in HTML/js work best for this?
add checks if the uploaded data is compatible with the required feature format for the task description

evaluate WebRTC / javascript as backend network stack

as suggested by @Saipraneet, as an alternative to stand-alone applications frameworks such as libp2p, see #3 , we should evaluate browser-based communication frameworks. this could be advantageous in that it might be easier to get it to run on desktop and mobiles.

webRTC is supported in modern browsers. on top of it, there are several javascript peerJS, and in particular simple-peer looks very suitable.

could this be suitable for communicating gradients as in ML training, supporting desktop or mobile phone OSs (e.g. coreML?), and for interfacing with e.g. PyTorch or similar schemes?

if so, we could try sth like this to build a p2p communication prototype #5 ? what do people think?
seems we need some help here...

p2p dummy communication prototype

Using https://libp2p.io/ or a simplified backend, build a first prototype which connects nodes and allows exchange of dummy arrays (tensors) between nodes.

See #3 for discussion on libp2p

simulation code - step3, decentralized SGD as a reference implementation, asynchronous / time-varying graphs

depends on #1 . related to #2 also.

modify the reference code (simulated decentralized) for a given communication graph, on a standard/toy dataset.

incorporate a basic asynchronous model, i.e. allowing node and edge failures in SGD, or in other words a few variants time-varying graphs. this can also be used to simulate some realistic notions of fault tolerance

this will be used later to compare the p2p version to it, and to test different algorithm variants before implementing them in the real p2p framework

WebGL context lost error

I encounter an issue with WebGL when trying to train an image model (lus or mnist) on a large dataset.

Full error:

Host each task as a separate Google App Engine service

Now that we use Google App Engine (GAE) to host our app, we can no longer use different ports for different tasks since we access the server through a domain. Instead, we can host each task as a separate service within our app. In this way, each task can have its separate domain. The potential benefit is that we do not have to interrupt existing tasks to add a new task. I haven't tried this yet. Let me know what you think and whether you have any alternative ideas.

Ping: @martinjaggi, @tvogels

GAE Reference:
https://cloud.google.com/appengine/docs/standard/nodejs/an-overview-of-app-engine

integration with mobile: android & iOS - DL model deployment and p2p integration

allow local test sets

given a task description (see #26), combined with a local dataset which was uploaded locally (see e.g. #28)

allow the user to define a local test set. (button in the UI?)
for example, just remove 20% from the local train set and mark it as test set.
implement the performance metric (accuracy) of a model on that test set.
add the performance metric description to the task description.
display the current test accuracy in the UI locally, and update it once a model is received.

Extend testing component to allow multiple data points to be tested at once

Allow the possibility to add upload several images for testing at the same time
Make results available in the form of a CSV that can be downloaded by the user

Allow task owner to set a password for joining a task

For allowing joint training within a smaller set of trusted participants only (in addition to public tasks)

Add an optional password hash to the task description.
If present, new users will be asked the password before being able to join a task (like "joining this task is password restricted").
Add this to the documentation of how users can create a new task.

This functionality basically recovers federated learning as a special case. This is not giving the same security standard as a full PKI, and is not a replacement for a selection mechanism for helpful clients (as opposed to Byzantine ones), but is a start (see the same discussions in federated learning literature)

Byzantine robust decentralized algorithm variants

Support a subset of nodes which behaves arbitrary malicious (sends arbitrary messages to neighbors, instead of true gradients).

Postponed for now until the honest workers model is implemented and evaluated and the failure model for nodes is supported.

Communicating with Devices Behind NAT

We have to look into how to allow devices behind NAT to send and receive p2p messages.

Options:

LibP2P Built-in Mechanisms
https://en.wikipedia.org/wiki/TCP_hole_punching

Model inference

Check on the MNIST model inference - prediction for test digits.

Test Procedure

Train MNIST model locally on 10 digits (as attached below): 0-9 with result 100% training and 100% validation accuracy
Save model by clicking on button
Test model with digits used during training

Resulting CSV File output for all digits (uploaded in order smallest to largest digit) is attached to this post.

Digits used for training and testing:

Set up: tested on local machine, nvm v16.3.0, npm 7.15.1

Console log
[Log] Start: Processing Uploaded File
[Log] User File Validated. Start parsing.
[Log] Start Training
[Log] _________________________________________________________________
[Log] Layer (type) Output shape Param #
[Log] =================================================================
[Log] conv2d_Conv2D1 (Conv2D) [null,26,26,16] 448
[Log] _________________________________________________________________
[Log] max_pooling2d_MaxPooling2D1 [null,13,13,16] 0
[Log] _________________________________________________________________
[Log] conv2d_Conv2D2 (Conv2D) [null,11,11,32] 4640
[Log] _________________________________________________________________
[Log] max_pooling2d_MaxPooling2D2 [null,5,5,32] 0
[Log] _________________________________________________________________
[Log] conv2d_Conv2D3 (Conv2D) [null,3,3,32] 9248
[Log] _________________________________________________________________
[Log] flatten_Flatten1 (Flatten) [null,288] 0
[Log] _________________________________________________________________
[Log] dense_Dense1 (Dense) [null,64] 18496
[Log] _________________________________________________________________
[Log] dense_Dense2 (Dense) [null,10] 650
[Log] =================================================================
[Log] Total params: 33482
[Log] Trainable params: 33482
[Log] Non-trainable params: 0
[Log] _________________________________________________________________
[Log] Proxy
[Log] EPOCH (1): Train Accuracy: 100.00,
Val Accuracy: 100.00

[Log] loss 0.1054
[Log] Proxy
[Log] EPOCH (2): Train Accuracy: 100.00,
Val Accuracy: 100.00

[Log] loss 0.1054
[Log] Proxy
[Log] EPOCH (3): Train Accuracy: 100.00,
Val Accuracy: 100.00

[Log] loss 0.1054
[Log] Proxy
[Log] EPOCH (4): Train Accuracy: 100.00,
Val Accuracy: 100.00

[Log] loss 0.1054
[Log] Proxy
[Log] EPOCH (5): Train Accuracy: 100.00,
Val Accuracy: 100.00

[Log] loss 0.1054
[Log] Proxy
[Log] EPOCH (6): Train Accuracy: 100.00,
Val Accuracy: 100.00

[Log] loss 0.1054
[Log] Proxy
[Log] EPOCH (7): Train Accuracy: 100.00,
Val Accuracy: 100.00

[Log] loss 0.1054
[Log] Proxy
[Log] EPOCH (8): Train Accuracy: 100.00,
Val Accuracy: 100.00

[Log] loss 0.1054
[Log] Proxy
[Log] EPOCH (9): Train Accuracy: 100.00,
Val Accuracy: 100.00

[Log] loss 0.1054
[Log] Proxy
[Log] EPOCH (10): Train Accuracy: 100.00,
Val Accuracy: 100.00

[Log] loss 0.1054
[Log] mnist-model
[Log] Deactivated
[Log]

[Log] Loading model...
[Log] Model loaded.
[Log] Prediction Sucessful!
[Log] undefined
[Log] Object`

simulation code - step2, decentralized SGD as a reference implementation, static graph

depends on #6

get an easy to use reference code for training (simulated decentralized, so just running locally without any p2p backend) on any given communication graph, on a standard/toy dataset. this will be useful to later to compare the p2p version to it.

for simplicity we'll first assume all nodes perform one step of SGD (gradient and communication) per clock step, and that the underlying communication graph remains fixed (the code should allow giving an arbitrary graph as an input)

offer task descriptions from server (and design task description format)

the auxiliary server needs to be expanded to host task descriptions. each task description should contain:

title of the task
description string (maybe as md format)
features description (list for tabular data, or e.g. image format description), and label description. best to include an example datapoint
DL model architecture (TF.js model.save)
training hyperparameters (learning rate, dropout etc)
initial model weights (can also be used for sharing model snapshots later, or when new nodes join)

to share it with the client we could organize them also via json for example, for the machine readable variant.

once it works, convert the dummy csv task (adult dataset) and mnist (image) tasks to this format an make sure they still work

human readable format: the task description should also be easy to visualize in human readable form, in HTML (so it can be served by th server as well as locally on the client)

.csv column reassignment UI

small simplification suggestion (need only id ← edit field)

should look like (or even on same line):

current look:

Add changes requested for modularisation

rename the model memory to “My Model Library”
change "titanic model information" to "titanic task information"
add Node version to the readme
naming convention update (snake case)
folder name (no space, respect the snake case convention)
remove peers server from readme (or add later as optional. or test with gcloud, martin m has set up credentials already)
update the testing component for images (take the MNIST testing component and modularise it)

simulation code - step1, data distribution

provide a simulated decentralized code (not using any p2p backend, but instead just running locally), which holds a communication graph, and distributes a standard/toy ML dataset among the nodes.

data distribution should support both random and heterogeneous / non-iid (for example different labels for each node).

we can use standard PyTorch code examples e.g. MNIST or Cifar

Image data type in task file

As our repository is currently loading JS Images to tensors, we do not have strict requirements on explicit image data types.
Remove requirements from task file.
Possibly also check on special image formats and test them. (transparency in images, svg)

Creation of a new model every time "join training" clicked + change button "train alone"

A new model is created every time someone is clicking the "join training" button in the description frame.
Add variable so that once the model is created, it's never created again (at least for the current session).

Testing component for CSV files

Now, we have the training component for CSV files, but no testing components for them. Introduce the testing component for CSV files.
The idea is simple, the training process can be viewed as a function that takes as input a standard CSV file for the task but without the column's labels. It then returns the same CSV file, but with the labels.

Documentation of the code

cifar / imagenet

cifar 10 or 100. or even go to imagenet directly :)

in the task description we can maybe include a link how people can download an (arbitrary) part of this official dataset from somewhere. a bit unclear what format would be best. probably too large to 'upload' it as individual images in the UI

imagenet could work too if lots of people would join and everyone would only have a very small part of the data

Improve the information page

Add more detailed descriptions and illustrations on the information page.

Support (decentralized) Normalization for Tabular Datasets

For tabular datasets (popular examples: adult income and titanic), normalization is critical for neural network approaches.

The most typical and a very effective way to normalize is to "subtract the mean and divide by the standard deviation". However, computing these in a decentralized fashion is non-trivial. For DeAI to support this, additional functionality needs to be implemented.

Examples of how this can be addressed:

Provide means and standard deviations for all features based on some a-priori knowledge. Each participant is then asked to normalize their data according to this standard before uploading.
Learn means and standard deviations as a pre-learning task, which is then automatically applied to each local dataset. This could be a full DeAI training cycle, or a simple weighted average which is democratically communicated.

incentives / rewards for data utility

this is a big one we'll look into a bit later

data upload: small improvements / some data sanity checks

we could add a few more small checks and error messages,
now in the MNIST example it throws some NaN error if not all image types are present. and in the Titanic one complains about the last column missing sometimes even when it's not (maybe doesn't like newline at the end of a csv file?). works nicely with the provided titanic data though

add status message in UI once data has been successfully loaded. and another one once training is successfully started

task list server (and p2p participants list) always running for easier use

let's have the helper server running on a small instance in the gcloud so the app can always be easily used (even with peer.js).

not sure how we want to keep the participants list updated at the moment (one or two? maybe one on gcloud?)

and if p2p is not available, show a message that the training did a fall-back to local training (training alone) instead

integrate peer.js distributed version with new UI

both are now separate folders inside `mobile-
once merged, it's enough to only have one folder

as a fallback, if the UI finds no peers we can also just let it do local training

dynamic networks / fault tolerant training algorithms

the training algorithm should support realistic changes of the communication graph, such as node failues or offline time. this issue here only considers non-malicious nodes. for Byzantine nodes, we'll discuss later in separate issues

we can experiment with some candidate algorithms from the following papers for example, and test them on the simulator.

A Unified Theory of Decentralized SGD with Changing Topology and Local Updates
https://arxiv.org/pdf/2003.10422
SwarmSGD: Scalable Decentralized SGD with Local Updates
https://arxiv.org/abs/1910.12308

Bug: Multiple peer communication (3 or more)

When training in a distributed manner with 3 peers or more there is a deadlock if one of the peers is not sharing weights and you set a threshold of 2.

The threshold should only be present for waiting until you receive all weights before you average weights but we should still put a time threshold (for example you will wait max 10seconds).

The file to be fixed is helpers.js in helpers/communication_script/helpers. One first problem I have seen is that checkArrayLen function in line 120 is useless and creating an infinite loop. The arr argument will never change and we are doing a loop until its length increases. There is a similar problem with dataReceivedBreak. I think the way to fix this is to have a general object accessible to all of these functions and then add a limit in number of tries to checkArrayLen.

Modularise and improve testing component for images

Make a general modularised component for testing the model

implement task selection on client

in the client UI, allow the client to see available tasks (see #26), and select one to join

Memory Handling

The main problem is the following:
TFJS has a special way of handling models (and all related TFJS objects) from a memory standpoint. Vue also has a special way of handling objects that are stored in its data hook.
This leads to errors when a TFJS model is saved in the data part of a Vue's component (i.e we are unable to process the model).
The solution so far has been to move the model (contained in the training manager) outside of the component. However, this leads to an error when doing the dynamic routing.
To make it simple: all tasks share the same frame called MainTrainingFrame. In this frame (or the related image or CSV training frame) the training manager is located outside the definition of the component. Hence, even though when we change task and a new component is created for that task, the state of the training manager is kept for the new component. This yields to unstable behaviors.
So this means that we can't store the training manager (that contains the TFJS model) outside the definition of the component, but we can't move it inside as well.

So the solution is to avoid having the model stored in an object. For instance, the training manager would not store the variable "model". Each time the actual TFJS model is required, it is loaded by the function that needs it.

privacy by secure aggregation / MPC

to achieve stronger privacy in terms of input privacy (private data but public models), we would like to avoid information leaks from individual gradients which are communicated. to do so, the following route seems viable:

use simple additive secure aggregation (part of secure multi-party computation / MPC) of all individual gradients.
this scheme computes a public average/sum or all individual gradient vectors, while keeping each individual vector private, see e.g. https://arxiv.org/abs/2006.04747 for the federated case.

Create new task for the LUS-COVID dataset

Standardisation of the uploading process between images and CSV files

Need to uniformize the uploading process between images and CSV files.
I think we can create:
(1) a File Upload Manager that has all the functions required to update an internal file list,
(2) a SingleFileFrame vuejs component (with only one uploading box),
(3) a FileUploadFrame vuejs component.
The two vuejs components take as input a File Manager. The FileUploadFrame would take an additional prop. that defines how many uploading boxes should be created (for each label).
This architecture allows us to have the same uploading process between images and CSV files. We can also add a parameter data_type to the File Manager so that we only keep files that are either images or CSV files.
The File Manager needs to be able to handle multiple files upload depending on the labels. Meaning, for documents stored in the file list, we might need to associate its label.

epfml / disco Goto Github PK

disco's Introduction

DISCO - DIStributed COllaborative Machine Learning

disco's People

Contributors

Stargazers

Watchers

Forkers

disco's Issues

Recommend Projects

Recommend Topics

Recommend Org