Git Product home page Git Product logo

disco's Introduction

DISCO - DIStributed COllaborative Machine Learning

DISCO leverages federated 🌟 and decentralized ✨ learning to allow several data owners to collaboratively build machine learning models without sharing any original data.

The latest version is always running on the following link, directly in your browser, for web and mobile:

πŸ•Ί https://epfml.github.io/disco/ πŸ•Ί


πŸͺ„ DEVELOPERS: Have a look at our developer guide


❓ WHY DISCO?

  • To build deep learning models across private datasets without compromising data privacy, ownership, sovereignty, or model performance
  • To create an easy-to-use platform that allows non-specialists to participate in collaborative learning

βš™οΈ HOW DISCO WORKS

  • DISCO has a public model – private data approach
  • Private and secure model updates – not data – are communicated to either:
    • a central server : federated learning ( 🌟 )
    • directly between users : decentralized learning ( ✨ ) i.e. no central coordination
  • Model updates are then securely aggregated into a trained model
  • See more HERE

❓ DISCO TECHNOLOGY

  • DISCO supports arbitrary deep learning tasks and model architectures, via TF.js
  • ✨ relies on peer2peer communication
  • Have a look at how DISCO ensures privacy and confidentiality HERE

πŸ§ͺ RESEARCH-BASED DESIGN

DISCO aims to enable open-access and easy-use distributed training which is

  • πŸŒͺ️ efficient (R1, R2)
  • πŸ”’ privacy-preserving (R3, R4)
  • πŸ› οΈ fault-tolerant and dynamic over time (R5)
  • πŸ₯· robust to malicious actors and data poisoning (R6, R7)
  • 🍎 🍌 interpretable in imperfectly interoperable data distributions (R8)
  • πŸͺž personalizable (R9)
  • πŸ₯• fairly incentivize participation

🏁 HOW TO USE DISCO

  • Start by exploring our example DISCOllaboratives in the Tasks page.
  • The example models are based on popular datasets such as Titanic, MNIST or CIFAR-10
  • It is also possible to create your own task without coding on the custom training page:
    • Upload the initial model
    • You can choose from several existing dataloaders
    • Choose between federated and decentralized for your DISCO training scheme ... connect your data and... done! πŸ“Š
    • For more details on ML tasks and custom training have a look at this guide

Note: Currently only CSV and Image data types are supported. Adding new data types, preprocessing code or dataloaders, is accessible in developer mode (see developer guide).

__

JOIN US

  • You are welcome on slack

disco's People

Contributors

annie-light avatar aunell avatar batuhanfaik avatar dependabot[bot] avatar dgengler6 avatar eduarddurech avatar francesco98 avatar giordano-lucas avatar giorgiosav avatar gozgun avatar grim-bot avatar ineiti avatar jiafanliu avatar julienvig avatar laurislopata avatar lippoldt avatar lucastrg avatar marceltorne avatar martinjaggi avatar mmilenkoski avatar morganridel avatar nacho114 avatar paulmansat avatar peacefulotter avatar s314cy avatar saipraneet avatar tharvik avatar walidabn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

disco's Issues

Bug: LUS-Covid task Model Storage

When I train a model on the LUS-Covid task (everything working properly and achieving a training accuracy of around 88% and validation accuracy of around 75%), then I leave the application and when I train the model again, I get a validation accuracy of 30% and a training accuracy of 50%.

I suspect there must be some problem with the creation of new models since the way I managed to fix it was delete the models from the storage and then refreshing the page and starting the application workflow from the task list.

Maybe the lus-covid task is not properly linked with the memory managers that were implemented afterwards.

Fix: Port numbers are hardcoded

At this moment, port numbers are being hardcoded in the communication_manager file. Inside this file we should always use the portNbr attribute passed in the constructor, and then each task should pass its corresponding port number to the communication_manager.

build a decentralized age detector from webcam images

as an example use-case, let's build a decentralized model for browser-based images

we can start from a stand-alone-version based on
https://github.com/justadudewhohacks/face-api.js/ (uses TF.js)
see for example
https://www.codeproject.com/Articles/5276827/AI-Age-Estimation-in-the-Browser-using-face-api-an

on top of it, we only need to extract gradients, to incorporate it into collaborative DeAI training. will also be a good use-case to refine our task-descriptions and image preprocessing pipelines and UI

Implement RelaySGD

We are currently using all-reduce scheme for model averaging. We should add the option to use RelaySGD. I will start working on this.

allow local data upload (for .csv) on client

after the client has joined a task (see issue #27 ), allow the client to upload (via an HTML form) a local training dataset to the browser local storage.

  • for .csv data (start e.g. with a standard public dataset, such as titanic or adult)
  • add checks if the uploaded data is compatible with the required feature format for the task description

Delete un-used components

With modularisation changes, some components are never used. Once every PR is merged to the master branch, delete these unwanted files (i.e ImageUploadFrame, GlobalTaskFrame, CSVUploadFrame ...).

switch from localstorage to indexedDB

Switch from local storage to indexedDB. This will allow a greater memory space for models.
Change serialization of the model: TensorFlow's has a different file architecture when a model is saved to local storage or indexedDB.
IndexedDB should be served by one separate script.

generic data processing capabilities I

Generalize and improve current data processing pipeline, with the goal to allow easy utilization of data as part of the projects, facilitate technical debugging, facilitate understanding of data science background and challenges and to separate those challenges from another.

In a minimalistic setting, this includes:

    • Catching data related errors as soon as possible in the MLOps pipeline
    • Providing functions that connect the tasks with the interface to include/exclude data types from loading
    • Check on data loading success status

In addition, the following features would be nice to have: (now or later)
4) Ability to get data science style diagnosis of data on the interface (label distribution, before/after training evaluation)
5) capabilities to handle advanced bugs relating to data loading - handle corrupt/invalid data in an optimal way
6) minimize RAM/VRAM data and possibly provide overview on the UI

As a follow up, this would require current tasks/projects to be tested and checked again and verified, with possibly changes in the testing functions currently set up separately. Some tasks might be better connected to the UI and UI changes would improve the overall usability. (possibly in a separate issue)

If data loaders work similar to the python version, a lot of these tasks can be handled naturally, but some sub tasks would require follow ups.

allow local data upload for image data

like #28, but for images

  • for mnist or other image data (images with labels, maybe one folder per label for simpler user experience?)
  • check how `local storage' or standard files in HTML/js work best for this?
  • add checks if the uploaded data is compatible with the required feature format for the task description

evaluate WebRTC / javascript as backend network stack

as suggested by @Saipraneet, as an alternative to stand-alone applications frameworks such as libp2p, see #3 , we should evaluate browser-based communication frameworks. this could be advantageous in that it might be easier to get it to run on desktop and mobiles.

webRTC is supported in modern browsers. on top of it, there are several javascript peerJS, and in particular simple-peer looks very suitable.

could this be suitable for communicating gradients as in ML training, supporting desktop or mobile phone OSs (e.g. coreML?), and for interfacing with e.g. PyTorch or similar schemes?

if so, we could try sth like this to build a p2p communication prototype #5 ? what do people think?
seems we need some help here...

simulation code - step3, decentralized SGD as a reference implementation, asynchronous / time-varying graphs

depends on #1 . related to #2 also.

modify the reference code (simulated decentralized) for a given communication graph, on a standard/toy dataset.

incorporate a basic asynchronous model, i.e. allowing node and edge failures in SGD, or in other words a few variants time-varying graphs. this can also be used to simulate some realistic notions of fault tolerance

this will be used later to compare the p2p version to it, and to test different algorithm variants before implementing them in the real p2p framework

WebGL context lost error

I encounter an issue with WebGL when trying to train an image model (lus or mnist) on a large dataset.

Full error:
lus-deai-error

Host each task as a separate Google App Engine service

Now that we use Google App Engine (GAE) to host our app, we can no longer use different ports for different tasks since we access the server through a domain. Instead, we can host each task as a separate service within our app. In this way, each task can have its separate domain. The potential benefit is that we do not have to interrupt existing tasks to add a new task. I haven't tried this yet. Let me know what you think and whether you have any alternative ideas.

Ping: @martinjaggi, @tvogels

GAE Reference:
https://cloud.google.com/appengine/docs/standard/nodejs/an-overview-of-app-engine

allow local test sets

given a task description (see #26), combined with a local dataset which was uploaded locally (see e.g. #28)

  • allow the user to define a local test set. (button in the UI?)
    for example, just remove 20% from the local train set and mark it as test set.

  • implement the performance metric (accuracy) of a model on that test set.

  • add the performance metric description to the task description.

  • display the current test accuracy in the UI locally, and update it once a model is received.

Allow task owner to set a password for joining a task

For allowing joint training within a smaller set of trusted participants only (in addition to public tasks)

  • Add an optional password hash to the task description.
  • If present, new users will be asked the password before being able to join a task (like "joining this task is password restricted").
  • Add this to the documentation of how users can create a new task.

This functionality basically recovers federated learning as a special case. This is not giving the same security standard as a full PKI, and is not a replacement for a selection mechanism for helpful clients (as opposed to Byzantine ones), but is a start (see the same discussions in federated learning literature)

Byzantine robust decentralized algorithm variants

Support a subset of nodes which behaves arbitrary malicious (sends arbitrary messages to neighbors, instead of true gradients).

Postponed for now until the honest workers model is implemented and evaluated and the failure model for nodes is supported.

Model inference

Check on the MNIST model inference - prediction for test digits.

Test Procedure

  1. Train MNIST model locally on 10 digits (as attached below): 0-9 with result 100% training and 100% validation accuracy
  2. Save model by clicking on button
  3. Test model with digits used during training

Resulting CSV File output for all digits (uploaded in order smallest to largest digit) is attached to this post.
Screenshot 2021-06-21 at 15 40 33

Digits used for training and testing:
Screenshot 2021-06-21 at 14 48 39

0-32
1-32
2-32
3-32
4-32
5-32
6-32
7-32
8-32
9-filled-32

Set up: tested on local machine, nvm v16.3.0, npm 7.15.1

Console log
[Log] Start: Processing Uploaded File
[Log] User File Validated. Start parsing.
[Log] Start Training
[Log] _________________________________________________________________
[Log] Layer (type) Output shape Param #
[Log] =================================================================
[Log] conv2d_Conv2D1 (Conv2D) [null,26,26,16] 448
[Log] _________________________________________________________________
[Log] max_pooling2d_MaxPooling2D1 [null,13,13,16] 0
[Log] _________________________________________________________________
[Log] conv2d_Conv2D2 (Conv2D) [null,11,11,32] 4640
[Log] _________________________________________________________________
[Log] max_pooling2d_MaxPooling2D2 [null,5,5,32] 0
[Log] _________________________________________________________________
[Log] conv2d_Conv2D3 (Conv2D) [null,3,3,32] 9248
[Log] _________________________________________________________________
[Log] flatten_Flatten1 (Flatten) [null,288] 0
[Log] _________________________________________________________________
[Log] dense_Dense1 (Dense) [null,64] 18496
[Log] _________________________________________________________________
[Log] dense_Dense2 (Dense) [null,10] 650
[Log] =================================================================
[Log] Total params: 33482
[Log] Trainable params: 33482
[Log] Non-trainable params: 0
[Log] _________________________________________________________________
[Log] Proxy
[Log] EPOCH (1): Train Accuracy: 100.00,
Val Accuracy: 100.00

[Log] loss 0.1054
[Log] Proxy
[Log] EPOCH (2): Train Accuracy: 100.00,
Val Accuracy: 100.00

[Log] loss 0.1054
[Log] Proxy
[Log] EPOCH (3): Train Accuracy: 100.00,
Val Accuracy: 100.00

[Log] loss 0.1054
[Log] Proxy
[Log] EPOCH (4): Train Accuracy: 100.00,
Val Accuracy: 100.00

[Log] loss 0.1054
[Log] Proxy
[Log] EPOCH (5): Train Accuracy: 100.00,
Val Accuracy: 100.00

[Log] loss 0.1054
[Log] Proxy
[Log] EPOCH (6): Train Accuracy: 100.00,
Val Accuracy: 100.00

[Log] loss 0.1054
[Log] Proxy
[Log] EPOCH (7): Train Accuracy: 100.00,
Val Accuracy: 100.00

[Log] loss 0.1054
[Log] Proxy
[Log] EPOCH (8): Train Accuracy: 100.00,
Val Accuracy: 100.00

[Log] loss 0.1054
[Log] Proxy
[Log] EPOCH (9): Train Accuracy: 100.00,
Val Accuracy: 100.00

[Log] loss 0.1054
[Log] Proxy
[Log] EPOCH (10): Train Accuracy: 100.00,
Val Accuracy: 100.00

[Log] loss 0.1054
[Log] mnist-model
[Log] Deactivated
[Log]

[Log] Loading model...
[Log] Model loaded.
[Log] Prediction Sucessful!
[Log] undefined
[Log] Object`

simulation code - step2, decentralized SGD as a reference implementation, static graph

depends on #6

get an easy to use reference code for training (simulated decentralized, so just running locally without any p2p backend) on any given communication graph, on a standard/toy dataset. this will be useful to later to compare the p2p version to it.

for simplicity we'll first assume all nodes perform one step of SGD (gradient and communication) per clock step, and that the underlying communication graph remains fixed (the code should allow giving an arbitrary graph as an input)

offer task descriptions from server (and design task description format)

the auxiliary server needs to be expanded to host task descriptions. each task description should contain:

  • title of the task
  • description string (maybe as md format)
  • features description (list for tabular data, or e.g. image format description), and label description. best to include an example datapoint
  • DL model architecture (TF.js model.save)
  • training hyperparameters (learning rate, dropout etc)
  • initial model weights (can also be used for sharing model snapshots later, or when new nodes join)

to share it with the client we could organize them also via json for example, for the machine readable variant.

once it works, convert the dummy csv task (adult dataset) and mnist (image) tasks to this format an make sure they still work

human readable format: the task description should also be easy to visualize in human readable form, in HTML (so it can be served by th server as well as locally on the client)

.csv column reassignment UI

small simplification suggestion (need only id ← edit field)

should look like (or even on same line):
Screen Shot 2021-06-22 at 00 23 03

current look:
Screen Shot 2021-06-22 at 00 20 00

Add changes requested for modularisation

  • rename the model memory to β€œMy Model Library”
  • change "titanic model information" to "titanic task information"
  • add Node version to the readme
  • naming convention update (snake case)
  • folder name (no space, respect the snake case convention)
  • remove peers server from readme (or add later as optional. or test with gcloud, martin m has set up credentials already)
  • update the testing component for images (take the MNIST testing component and modularise it)

simulation code - step1, data distribution

provide a simulated decentralized code (not using any p2p backend, but instead just running locally), which holds a communication graph, and distributes a standard/toy ML dataset among the nodes.

data distribution should support both random and heterogeneous / non-iid (for example different labels for each node).

we can use standard PyTorch code examples e.g. MNIST or Cifar

Image data type in task file

As our repository is currently loading JS Images to tensors, we do not have strict requirements on explicit image data types.
Remove requirements from task file.
Possibly also check on special image formats and test them. (transparency in images, svg)

Testing component for CSV files

Now, we have the training component for CSV files, but no testing components for them. Introduce the testing component for CSV files.
The idea is simple, the training process can be viewed as a function that takes as input a standard CSV file for the task but without the column's labels. It then returns the same CSV file, but with the labels.

cifar / imagenet

cifar 10 or 100. or even go to imagenet directly :)

in the task description we can maybe include a link how people can download an (arbitrary) part of this official dataset from somewhere. a bit unclear what format would be best. probably too large to 'upload' it as individual images in the UI

imagenet could work too if lots of people would join and everyone would only have a very small part of the data

Support (decentralized) Normalization for Tabular Datasets

For tabular datasets (popular examples: adult income and titanic), normalization is critical for neural network approaches.

The most typical and a very effective way to normalize is to "subtract the mean and divide by the standard deviation". However, computing these in a decentralized fashion is non-trivial. For DeAI to support this, additional functionality needs to be implemented.

Examples of how this can be addressed:

  • Provide means and standard deviations for all features based on some a-priori knowledge. Each participant is then asked to normalize their data according to this standard before uploading.
  • Learn means and standard deviations as a pre-learning task, which is then automatically applied to each local dataset. This could be a full DeAI training cycle, or a simple weighted average which is democratically communicated.

data upload: small improvements / some data sanity checks

we could add a few more small checks and error messages,
now in the MNIST example it throws some NaN error if not all image types are present. and in the Titanic one complains about the last column missing sometimes even when it's not (maybe doesn't like newline at the end of a csv file?). works nicely with the provided titanic data though

add status message in UI once data has been successfully loaded. and another one once training is successfully started

task list server (and p2p participants list) always running for easier use

let's have the helper server running on a small instance in the gcloud so the app can always be easily used (even with peer.js).

not sure how we want to keep the participants list updated at the moment (one or two? maybe one on gcloud?)

  • and if p2p is not available, show a message that the training did a fall-back to local training (training alone) instead

dynamic networks / fault tolerant training algorithms

the training algorithm should support realistic changes of the communication graph, such as node failues or offline time. this issue here only considers non-malicious nodes. for Byzantine nodes, we'll discuss later in separate issues

we can experiment with some candidate algorithms from the following papers for example, and test them on the simulator.

Bug: Multiple peer communication (3 or more)

When training in a distributed manner with 3 peers or more there is a deadlock if one of the peers is not sharing weights and you set a threshold of 2.

The threshold should only be present for waiting until you receive all weights before you average weights but we should still put a time threshold (for example you will wait max 10seconds).

The file to be fixed is helpers.js in helpers/communication_script/helpers. One first problem I have seen is that checkArrayLen function in line 120 is useless and creating an infinite loop. The arr argument will never change and we are doing a loop until its length increases. There is a similar problem with dataReceivedBreak. I think the way to fix this is to have a general object accessible to all of these functions and then add a limit in number of tries to checkArrayLen.

Memory Handling

The main problem is the following:
TFJS has a special way of handling models (and all related TFJS objects) from a memory standpoint. Vue also has a special way of handling objects that are stored in its data hook.
This leads to errors when a TFJS model is saved in the data part of a Vue's component (i.e we are unable to process the model).
The solution so far has been to move the model (contained in the training manager) outside of the component. However, this leads to an error when doing the dynamic routing.
To make it simple: all tasks share the same frame called MainTrainingFrame. In this frame (or the related image or CSV training frame) the training manager is located outside the definition of the component. Hence, even though when we change task and a new component is created for that task, the state of the training manager is kept for the new component. This yields to unstable behaviors.
So this means that we can't store the training manager (that contains the TFJS model) outside the definition of the component, but we can't move it inside as well.

So the solution is to avoid having the model stored in an object. For instance, the training manager would not store the variable "model". Each time the actual TFJS model is required, it is loaded by the function that needs it.

privacy by secure aggregation / MPC

to achieve stronger privacy in terms of input privacy (private data but public models), we would like to avoid information leaks from individual gradients which are communicated. to do so, the following route seems viable:

use simple additive secure aggregation (part of secure multi-party computation / MPC) of all individual gradients.
this scheme computes a public average/sum or all individual gradient vectors, while keeping each individual vector private, see e.g. https://arxiv.org/abs/2006.04747 for the federated case.

Standardisation of the uploading process between images and CSV files

  • Need to uniformize the uploading process between images and CSV files.
  • I think we can create:
    (1) a File Upload Manager that has all the functions required to update an internal file list,
    (2) a SingleFileFrame vuejs component (with only one uploading box),
    (3) a FileUploadFrame vuejs component.
    The two vuejs components take as input a File Manager. The FileUploadFrame would take an additional prop. that defines how many uploading boxes should be created (for each label).
    This architecture allows us to have the same uploading process between images and CSV files. We can also add a parameter data_type to the File Manager so that we only keep files that are either images or CSV files.
  • The File Manager needs to be able to handle multiple files upload depending on the labels. Meaning, for documents stored in the file list, we might need to associate its label.

welcome screen

landing screen to explain the basics of the app and the main privacy model (similar to current readme)

on the bottom of it we can have a button to go to the task list 'show available tasks' or similar. we can also place a link here leading to the documentation on 'how to create a new task'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.