Git Product home page Git Product logo

fedbiomed's Introduction

Documentation License Python-versions Citation PR codecov

Fed-BioMed

Introduction

Fed-BioMed is an open source project focused on empowering biomedical research using non-centralized approaches for statistical analysis and machine learning.

The project is currently based on Python, PyTorch and Scikit-learn, and enables developing and deploying collaborative learning analysis in real-world machine learning applications, including federated learning and federated analytics.

The code is regularly released and available on the master branch of this repository. The documentation of the releases can be found at https://fedbiomed.org

Curious users may also be interested by the current developments, occurring in the develop branch (https://github.com/fedbiomed/fedbiomed/tree/develop) According to our coding rules, the develop branch is usable, tests and tutorials will run, but the documentation may be not fully available or desynchronizing with the code. We only provide support for the last release aka the master branch.

Install and run in development environment

Fed-BioMed is developped under Linux Fedora & Ubuntu, should be easily ported to other Linux distributions. It runs also smoothly on macOSX, and in Windows WSL2.

This README.md file provide a quick start/installation guide for Linux.

Full installation instruction are also available at: https://fedbiomed.org/latest/tutorials/installation/0-basic-software-installation/

An installation guide is also provided for Windows11, which relies on WSL2: https://fedbiomed.org/latest/user-guide/installation/windows-installation/

Prerequisites :

To ensure fedbiomed will work fine, you need to install before :

  • conda
  • git

docker is not needed anymore for using development environment, unless you use secure aggregation with MP-SPDZ for multi party computation and plan to rebuild the Shamir protocol binary

clone repo

Clone the Fed-BioMed repository for running the software :

git clone -b master https://github.com/fedbiomed/fedbiomed.git

Fed-BioMed developers clone of the repository :

git clone [email protected]:fedbiomed/fedbiomed.git

setup conda environments

  • to create or update the environments, you can use the configure_conda script:
$ ./scripts/configure_conda
  • this script will create/update the conda environments

  • there is one specific environment for each component:

    • fedbiomed-node.yaml : environment for the node part
    • fedbiomed-researcher.yaml : environment for the researcher part
    • fedbiomed-gui.yaml : environment for the data management gui on the node

Remark:

  • this script can also be used to update only some of the environments
  • for some components, we provide different versions of yaml files depending of the operating system of your host
  • in case of (conda or python) errors, we advice to remove all environments and restart from fresh (use the -c flag of configure_conda)
  • general usage for this script is:
Usage: configure_conda [-n] [-c] [-t] [ENV ENV ..]

Install/update conda environments for fedbiomed. If several ENV
are provided, only these components will be updated. If no ENV is
provided, all components will be updated.

ENV can be node, researcher, gui (or a combination of them)

 -h, --help            this help
 -n, --dry-run         do nothing, just print what the script would do
 -c, --clean           remove environment before reinstallating it
 -t, --test            test the environment at the end of installation
                       (this only tests the researcher environment for now)

activate the environments

In a terminal, you can configure environments to work interactively inside a specific repository, with the right conda environment and the right PYTHONPATH environment.

WARNING: this script only work for bash, ksh and zsh. It is not compliant with c variant of shell (csh/tcsh/etcsh/...)

source ./scripts/fedbiomed_environment ENV

where ENV chosen from:

  • node
  • researcher
  • gui

run the software

run the node part

  • in a new terminal:
$ ./scripts/fedbiomed_run node start
  • this will launch a new node

  • you may also upload new data on this node with:

$ ./scripts/fedbiomed_run node dataset add
  • you may also specify a new config file for the node (usefull then running multiple test nodes on the same host)
$ ./scripts/fedbiomed_run node --config another_config.ini start
  • if you want to change the default IP address used to join the fedbiomed researcher component (localhost), you can provide it at launch time:
$ RESEARCHER_SERVER_HOST=192.168.0.100 ./scripts/fedbiomed_run node start
$ RESEARCHER_SERVER_HOST=192.168.0.100./scripts/fedbiomed_run researcher start

(adjust the 192.168.0.100 IP address to your configuration)

If this option is given at the first launch or after a clean, it is saved in the configuration file and becomes the default for subsequent launches. If this option is given at a subsequent launch, it only affects this launch.

run a researcher notebook

  • in a new terminal:
$ ./scripts/fedbiomed_run researcher start
  • this will launch a new jupyter notebook working in the notebooks repository. Some notebooks are available:

    • 101_getting-started.ipynb : training a simplenet + federated average on MNIST data
    • pytorch-local-training.ipynb : comparing the simplenet + federated average on MNIST data with its local training equivalent

run a researcher script

  1. in a new terminal:
$ source ./scripts/fedbiomed_environment researcher
  1. convert the notebook to a python script
jupyter nbconvert --output=101_getting-started --to script ./notebooks/101_getting-started.ipynb
  1. then you can use any researcher script
$ python ./notebooks/101_getting-started.py

change IP address for researcher in the current bash

By default, fedbiomed-node contacts fedbiomed-researcher on localhost. To configure your current shell to use another IP address for joining fedbiomed-researcher (e.g. 192.168.0.100):

source ./scripts/fedbiomed_environment node 192.168.0.100
source ./scripts/fedbiomed_environment researcher 192.168.0.100

Then launch the components with usual commands while you are in the current shell.

Warning: this option does not modify the existing configuration file (.ini file).

clean state (restore environments back to new)

De-configure environments, remove all configuration files and caches

source ./scripts/fedbiomed_environment clean

Install and run in vpn+development environment

Prerequisites

To use the docker + VPN mode you need to install:

  • docker
  • docker compose v2 (aka docker compose plugin)

Files

The envs/vpn directory contains all material for VPN support. A full technical description is provided in envs/vpn/README.md

The ./scripts/fedbiomed_vpn script is provided to ease the deployment of a set of docker container(s) with VPN support. The provided containers are:

  • fedbiomed/vpn-vpnserver: WireGuard server
  • fedbiomed/vpn-researcher: a researcher jupyter notebooks
  • fedbiomed/vpn-node: a node component
  • fedbiomed/vpn-gui: a GUI for managing node component data

All these containers are communicating through the Wireguard VPN server.

Setup and run all the docker containers

To setup all these components, you should:

  • clean all containers and files
./scripts/fedbiomed_vpn clean
  • build all the docker containers
./scripts/fedbiomed_vpn build
  • configure the wireguard encryption keys of all containers
./scripts/fedbiomed_vpn configure
  • start the containers
./scripts/fedbiomed_vpn start
  • check the containers status (presence and Wireguard configuration)
./scripts/fedbiomed_vpn status
  • run a fedbiomed_run command inside the node component. Eg:
./scripts/fedbiomed_vpn node dataset add --mnist /data
./scripts/fedbiomed_vpn node list
./scripts/fedbiomed_vpn node start
  • connect to the researcher jupyter at http://127.0.0.1:8888 (Remark: the researcher* docker automatically starts a jupyter notebook inside the container)

  • manage data inside the node with the node GUI at http://127.0.0.1:8484

  • stop the containers:

./scripts/fedbiomed_vpn stop

managing individual containers

You can manage individually the containers for the build/stop/start phases, by passing the name of the container(s) on the command line.

For example, to build only the node, you can use:

./scripts/fedbiomed_vpn build node

You can build/configure/stop/start/check more than one component at a time. Example:

./scripts/fedbiomed_vpn build gui node

This will stop and build the node container.

The list of the container names is:

  • vpnserver
  • researcher
  • node
  • gui

Remarks:

  • the configuration files are keeped then rebuilding individual containers
  • to remove the old config files, you should do a clean
  • restarting only vpnserver when others are running may lead to unpredictable behavior. In this case, it is adviced to restart from scratch (clean/build/configure/start)

Misc developer tools to help debugging

scripts/lqueue

list the content of a message queue (as used in fedbiomed.node and fedbiomed.researcher)

usage: lqueue directory or lqueue dir1 dir2 dir3 ...

scripts/run_end_to_end_test

Run a full (end to end) test by launching:

  • a researcher (running a python script or a notebook script)
  • several nodes, providing data

Usefully for continuous integration tests and notebook debugging. Full documentation in tests/README.md file.

Documentation

Required python modules should be installed to be able to build or serve the documentation page. These packages can be installed using conda environment to serve or build the documentation (recommended).

conda env update -f envs/build/conda/fedbiomed-doc.yaml
conda activate fedbiomed-doc

They can also be installed using pip (required python version 3.11), as in the real build process (if you know what you're doing).

  • Warning: if not using a conda or pip virtual environment, your global settings are modified.
pip install -r envs/development/docs-requirements.txt

Please use following command to serve documentation page. This will allow you to test/verify changes in docs and also in doc-strings.

cd ${FEDBIOMED_DIR}
./scripts/docs/fedbiomed_doc.sh serve

Please see usage for additional options.

cd ${FEDBIOMED_DIR}
./scripts/docs/fedbiomed_doc.sh --help

Using Tensorboard

To enable tensorboard during training routine to see loss values, you need to set tensorboard parameter to True while initializing Experiment class.

exp = Experiment(tags=tags,
                 #nodes=None,
                 model_args=model_args,
                 training_plan_class=MyTrainingPlan,
                 training_args=training_args,
                 round_limit=round_limit,
                 aggregator=FedAverage(),
                 node_selection_strategy=None,
                 tensorboard=True
                )

Or after initialization :

exp.set_tensorboard(True)

During training, the scalar values (loss) will be writen in the runs directory. You can either start tensorboard from jupyter notebook or terminal window.

Start tensorboard from notebook

First you should import TENSORBOARD_RESULTS_DIR from researcher environment in another cell

from fedbiomed.researcher.environ import environ
tensorboard_dir = environ['TENSORBOARD_RESULTS_DIR']

Load tensorboard extension in a different code block.

%load_ext tensorboard

Run following command to start tensorboard

tensorboard --logdir "$tensorboard_dir"

Start tensorboard from terminal command line

Open new terminal and change directory to Fed-BioMed base directory (${FEDBIOMED_DIR})

Make sure that already activated fedbiomed researcher conda environment :

source ./scripts/fedbiomed_environment researcher

Launch tensorboard with the following command :

tensorboard --logdir "$tensorboard_dir"`

Model Hashing and Enabling Model Approve

Fed-BioMed offers optional training plan approval feature to approve the training plans requested by the researcher. This training plan approval process is done by hashing/checksum operation by the ModelManager of node instance. When the TRAINING_PLAN_APPROVAL mode is enabled, node should register/approve training plan files before performing the training. For testing and easy development, there are already presented default training plans by Fed-BioMed for the tutorials that we provide in the notebooks directory. However, node can also enable or disable the mode for allowing default training plans to perform training.

Config file for security parameters

Enabling training plan approval mode, allowing default Fed-BioMed training plans and the hashing algorithm that will be performed for the checksum operation can be configurred from the config file of the node. The following code snippet represents an example security section of config file with default values.

[default]
# ....

[security]
hashing_algorithm = SHA256
allow_default_training_plans = True
training_plan_approval = False

By default, when node is launched for the first time without additional security parameters, training_plan_approval mode comes as disabled. If training_plan_approval is disabled the status of allow_defaults_training_plans will have no effect. To enable training_plan_approval you should set training_plan_approval to True and if it is desired allow_default_training_plans can be set to False for not accepting training plans of default Fed-BioMed examples.

The default hashing algorithm is SHA256 and it can also be changed to other hashing algorithms that are provided by Fed-BioMed. You can see the list of Hashing algorithms in the following section.

Hashing Algorithms

ModelManager provides different hashing algorithms, and the algorithm can be changed through the config file of the node. The name of the algorithms should be typed with capital letters. However, after changing hashing algorithm node should be restarted because it checks/updates hashing algorithms of the register/default training plans during the starting process.

Provided hashing algorithms are SHA256, SHA384, SHA512, SHA3_256, SHA3_384, SHA3_512, BLAKE2B and BLAKE2S. These are the algorithms that has been guaranteed by hashlib library of Python.

Starting nodes with different modes

To enable training_plan_approval mode and allow_default_training_plans node, start the following command.

ENABLE_TRAINING_PLAN_APPROVAL=True ALLOW_DEFAULT_TRAINING_PLANS=True ./scripts/fedbiomed_run node --config config-n1.ini start

This command will start the node with training plan approval activated mode even the config file has been set as training_plan_aproval = False. However it doesn't change the config file. If there is no config file named config-n1.ini it creates a config file for the node with enabled training plan approval mode.

[security]
hashing_algorithm = SHA256
allow_default_training_plans = True
training_plan_approval = True


For starting node with disabled training plan approval and default training plans;

```shell
ENABLE_TRAINING_PLAN_APPROVAL=False ALLOW_DEFAULT_TRAINING_PLANS=False ./scripts/fedbiomed_run node --config config-n1.ini start

Default TrainingPlans

Default training plans are located in the envs/common/default_training_plans/ directory as txt files. Each time the node starts with the training_plan_approval = True and allow_default_training_plan = True modes, hashing of the training plan files are checked to detect if the file is modified, the hashing algorithm has changed or is there any new training plan file added. If training plan files are modified ModelManager updates hashes for these training plans in the database. If the hashing algorithm of the training plan is different from the active hashing algorithm, hashes also get updated. This process only occurs when both training_plan_approval and allow_default_training_plan modes are activated. To add new default training plan for the examples or for testing, training plan files should be saved as txt and copied into the envs/common/default_training_plans directory. After the copy/save operation node should be restarted.

Registering New TrainingPlans

New training plans can be registered using fedbiomed_run scripts with training-plan register option.

./scripts/fedbiomed_run node --config config-n1.ini training-plan register

The CLI asks for the name of the training plan, description and the path where training plan file is stored. Model files should be saved as txt in the file system for registration. This is because these files are for only hashing purposes not for loading modules.

Deleting Registered TrainingPlans

Following command is used for deleting registered training plans.

./scripts/fedbiomed_run node --config config-n1.ini training-plan delete

Output of this command will list registered training plans with their name and id. It will ask to select training plan file you would like to remove. For example, in the follwing example, typing 1 will remove the MyModel from registered/approved list of training plans.

Select the training plan to delete:
1) MyModel	 Model ID training_plan_98a1e68d-7938-4889-bc46-357e4ce8b6b5
Select:

Default training plans can not be removed using fedbiomed CLI. They should be removed from the envs/common/default_training_plans directory. After restarting the node, deleted training plan files will be also removed from the TrainingPlans table of the node DB.

Updating Registered training plan

Following command is used for updating registered training plans. It updates chosen training plan with provided new training plan file. User also can provide same training plan file to update its content.

./scripts/fedbiomed_run node --config config-n1.ini training-plan update

Fed-BioMed Node GUI

Node GUI provides an interface for Node to manage datasets and deploy new ones. GUI consists of two components, Server and UI. Server is developed on Flask framework and UI is developed using ReactJS. Flask provides API services that use Fed-BioMed's DataManager for deploying and managing dataset. All the source files for GUI has been located on the ${FEDBIOMED_DIR}/gui directory.

Starting GUI

Node GUI can be started using Fed-BioMed CLI.

${FEDBIOMED_DIR}/scripts/fedbiomed_run node [--config [CONFIG_FILE_NAME]] gui --data-folder '<path-for-data-folder>' start

Arguments:

  • data-folder: Data folder represents the folder path where datasets have been stored. It can be absolute or relative path. If it is relative path, Fed-BioMed base directory is going to be used as reference. If datafolder is not provided. Script will look for data folder in the Fed-BioMed root directory and if it doesn't exist it will raise an error.
  • --config: Config file represents the name of the configuration file which is going to be used for GUI. If it is not provided, default will beconfig_node.ini.

It is also possible to start GUI on specific host and port, By default it is started localhost as host and 8484 as port. To change it you can modify following command.

The GUI is based on HTTPS and by default, it will generate a self-signed certificate for you. Butyou can also start GUI specifying the certificate and the private key names you want to use for HTTPS support. Please note that they must be in ${FEDBIOMED_DIR}/etc folder.

${FEDBIOMED_DIR}/scripts/fedbiomed_run node --config '<name-of-the-config-file> gui --data-folder '<path-for-data-folder>' ' cert '<name-of-certificate>' key '<name-of-private-key>' start

IMPORTANT: Please always consider providing data-folder argument while starting the GUI.

${FEDBIOMED_DIR}/scripts/fedbiomed_run node --config config-n1.ini gui data-folder ../data  --port 80 --host 0.0.0.0 start

Details of Start Process

When the Node GUI is started, it installs npm modules and builds ReactJS application in ${FEDBIOMED_DIR}/var/gui-build. If the GUI is already built (means that gui/ui/node_modules and var/gui-build folders exist), it does not reinstall and rebuild ReactJS. If you want to reinstall and rebuild, please add --recreate flag in the command same as below,

${FEDBIOMED_DIR}/scripts/fedbiomed_run node gui data-folder ../data --recreate start

Launching Multiple Node GUI

It is possible to start multiple Node GUIs for different nodes as long as the http ports are different. The commands below starts three Node GUI for the nodes; config-n1.ini, config-n2.ini and config-n3.ini on the ports respectively, 8181, 8282 and 8383.

${FEDBIOMED_DIR}/scripts/fedbiomed_run node --config config-n1.ini gui --data-folder ../data --port 8181 start
${FEDBIOMED_DIR}/scripts/fedbiomed_run node --config config-n2.ini gui --data-folder ../data --port 8282 start
${FEDBIOMED_DIR}/scripts/fedbiomed_run node --config config-n2.ini gui --data-folder ../data --port 8383 start

Development/Debugging for GUI

If you want to customize or work on user interface for debugging purposes, it is always better to use ReactJS in development mode, otherwise building GUI after every update will take a lot of time. To launch user interface in development mode first you need to start Flask server. This can be easily done with the previous start command. Currently, Flask server always get started on development mode. To enable debug mode you should add --debug flag to the start command.

${FEDBIOMED_DIR}/scripts/fedbiomed_run node --config config-n1.ini gui --data-folder ../data --debug start

Important: Please do not change Flask port and host while starting it for development purposes. Because React (UI) will be calling localhost:8484/api endpoint in development mode.

The command above will serve var/gui-build directory as well as API services. It means that on the URL localhost:8484 you will be able to see the user interface. This user interface won't be updated automatically because it is already built. To have dynamic update for user interface you can start React with npm start.

source ${FEDBIOMED_DIR}/scripts/fedbiomed_environment gui
cd ${FEDBIOMED_DIR}/gui/ui
npm start

After that if you go localhost:3000 you will see same user interface is up and running for development. When you change the source codes in ${FEDBIOMED_DIR}/gui/ui/src it will get dynamically updated on localhost:3000.

Since Flask is already started in debug mode, you can do your development/update/changes for server side (Flask) in ${FEDBIOMED_DIR}/gui/server. React part (ui) on development mode will call API endpoint from localhost:8484, this is why first you should start Flask server first.

After development/debugging is done. To update changes in built GUI, you need to start GUI with --recreate command. Afterward, you will be able to see changes on the localhost:8484 URL which serve built UI files.

${FEDBIOMED_DIR}/scripts/fedbiomed_run data-folder ../data gui --recreate start

Secure Aggregation Setup: Dev

Fed-BioMed uses MP-SPDZ to provide secure aggregation of the model parameters. Running secure aggregation in Fed-BioMed is optional which makes MP-SPDZ installation/configuration optional as well. Fed-BioMed will be able to run FL experiment without MP-SPDZ as long as secure aggregation is not activated on the nodes and the researcher components.

Configuring MP-SPDZ

Configuration or installation can be done with the following command by specifying the Fed-BioMed component. If node and the researcher will be started in the same clone if Fed-BioMed running following command with once (node or researcher) will be enough. For macOS, the operating system (Darwin) should higher than High Sierra (10.13)

${FEDBIOMED_DIR}/scripts/fedbiomed_configure_secagg (node|researcher)

Running MP-SPDZ protocols

MP-SPDZ protocols for secure aggregation and multi party computation will be executed internally by Fed-BioMed node and researcher components. The script for executing the protocols is located in ${FEDBIOMED_DIR}/scripts/fedbiomed_mpc. Please run following commands to see instructions and usage.

${FEDBIOMED_DIR}/scripts/fedbiomed_mpc (node | researcher) --help
${FEDBIOMED_DIR}/scripts/fedbiomed_mpc (node | researcher) *WORKDIR* compile --help
${FEDBIOMED_DIR}/scripts/fedbiomed_mpc (node | researcher) *WORKDIR* exec --help
${FEDBIOMED_DIR}/scripts/fedbiomed_mpc (node | researcher) *WORKDIR* shamir-server-key --help

fedbiomed's People

Contributors

angeacn avatar ayedsamy avatar clebreto avatar erwandemairy avatar ibalelli avatar j-l-s avatar lchambon avatar lena-le-quintrec avatar mvesin avatar pandrey-fr avatar rtaiello avatar sharkovsky avatar srcansiz avatar sssilvar avatar tkloczko avatar ybouilla avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fedbiomed's Issues

SP7-item5 : "phase 3" - Synchronous Training Experiment Resuming

In GitLab by @sssilvar on Jun 22, 2021, 10:33

Experiment should be able to resume from the last round (checkpoint) where it was successful.

Usage

# Example of failure due to client timeout
In []: experiment.run()
Out []: RuntimeError "Client not responding (Timeout Error)"

# Resume training (executing again)
In []: experiment.run()
Out []: warning: "Resuming experiment from round X..."

SP5-SP6_item3 : MNIST federated training convergence problem

In GitLab by @mvesin on Jun 16, 2021, 12:33

Hints :

[...]
Launching node...
     - Starting communication channel with network...
[...]
# For round 1 we use dry_run so no real training occurs
Train Epoch: 1 [0/60000 (0%)]    Loss: 2.291022
Uploading model parameters to fc8303d2-aad2-46e9-8c63-2f58abdff401.pt
[...]
# For round 2 and 3 we do not use dry_run
INFO] Training on dataset: /data/mvesin/data
Train Epoch: 1 [0/60000 (0%)]    Loss: 2.278209
Train Epoch: 1 [480/60000 (1%)]    Loss: 1.013210
Train Epoch: 1 [960/60000 (2%)]    Loss: 0.590519
Train Epoch: 1 [1440/60000 (2%)]    Loss: 0.542856
Train Epoch: 1 [1920/60000 (3%)]    Loss: 0.491376
....
Train Epoch: 1 [59040/60000 (98%)]    Loss: 0.068433
Train Epoch: 1 [59520/60000 (99%)]    Loss: 0.100257
Uploading model parameters to 2b55e6f1-c861-4a45-a166-c35a58397d12.pt
# Looks more or less or for round 2, loss is decreasing until a certain point
[...]
# But for round 3, loss restarts from initial value : is this normal ???
[INFO] Training on dataset: /data/mvesin/data
Train Epoch: 1 [0/60000 (0%)]    Loss: 2.338098
Train Epoch: 1 [480/60000 (1%)]    Loss: 1.019898
Train Epoch: 1 [960/60000 (2%)]    Loss: 0.952290
Train Epoch: 1 [1440/60000 (2%)]    Loss: 0.761609
Train Epoch: 1 [1920/60000 (3%)]    Loss: 0.493840
...
Train Epoch: 1 [58560/60000 (98%)]    Loss: 0.050281
Train Epoch: 1 [59040/60000 (98%)]    Loss: 0.140764
Train Epoch: 1 [59520/60000 (99%)]    Loss: 0.095856
# We not better after round 3 : not enough data to converge or bug ?

SP7-item6 : mesure execution time

In GitLab by @mvesin on Jul 5, 2021, 09:26

  • local execution time of a training function on a node (real time, process time)
  • total execution time for a training request on a dataset on a node

setup CI first version - [merged]

In GitLab by @mvesin on Jun 30, 2021, 12:36

Merges feature/test_ci -> develop

  • configure/clean environment for CI
  • first payload : integration test for running simplenet/fedavg/MNIST test with 1 client on a few batches, do not check accuracy of resulting model

SP10-item1 : add containers and vpn

In GitLab by @mvesin on May 12, 2021, 18:22

Build containers for

  • fedbiomed-node
  • fedbiomed-researcher
  • mqtt
  • htpp
    with
  • wireguard VPN support,

based on Fed-BioMed v2 VPN'ization

SP5_SP6-item6 : local training

In GitLab by @mvesin on May 12, 2021, 18:26

  • implement local training
  • compare with fed training (accuracy)
  • [ ] compare federated training with local training in CI test case

SP5_SP6-item6 : file repo organization

In GitLab by @mvesin on May 12, 2021, 18:31

  • add structure to django file repo : (per researcher), per client, per job
  • clean old files
  • move results of job from file repo to researcher side

SP5-item1 : rewrite imports

In GitLab by @massal on May 6, 2021, 12:56

Try follow PEP-8 including :

  • remove relative path
  • top of file
  • ordered standard/3rd party libs/application

Other :

  • avoid code execution in __init__.py
  • keep part of import path (eg from xxx import yyy or import xxx.yyy rather than from xxx.yyy import func1, func2)

SP5_SP6-item1 : clean execution in dev env

In GitLab by @massal on May 6, 2021, 13:08

On fedbiomed-researcher

  • separate code from config file
  • separate code from db, queue (var directory ?)
  • documentation and/or command for cleaning environment (config files, db, queue, cache torchhub)

On fedbiomed-node :

  • separate code from config file
  • separate code from db, queue (var directory ?)
  • documentation and/or command for cleaning environment (config files, db, queue, cache torchhub)

notes centralization

In GitLab by @jsaray on Jun 21, 2021, 17:15

It would be good to centralize all notes taken in meetings in a website that all people can review
(inria box for example, or any other network file system.)

DefaultStrategy class

In GitLab by @jsaray on Jun 22, 2021, 16:29

Code the class DefaultStrategy as it is shown in "Implement Experiment Class", this class is the simplest case where there is no sampling (all clients are choosen) and it should abort if any of the clients dont return. This task is a specialization of #38 after developing Experiment pseudo-code

SP5_SP6-item1 : code architecture

In GitLab by @massal on May 6, 2021, 12:57

  • create classes in modules
  • avoid code outside of classes in modules
  • group code in classes (eg: build message contents only in messaging/node classes, not in repository/json)
  • rename modules (eg: mqtt to message or messaging)
  • path structure for modules (eg rename fedbiomed_cli to fedbiomed/node, etc.)
  • [ ] private methods/variables for modules
  • others ?

SP5_SP6-item0 : gitlab repositories

In GitLab by @mvesin on May 12, 2021, 17:11

(Re)define and implement gitlab repositories :

  • merge or rebalance code between fedbiomed-node and fedbiomed-researcher ?
  • add private repo for Inria experiments (configs, datasets, results) ?

SP5_SP6-item0 : environments

In GitLab by @mvesin on May 12, 2021, 16:52

Describe and implement environments matching the life cycle of the application :

  • development 1 : localhost, conda
  • [ ] development 2 : docker, vpn, localhost
  • [ ] preprod : docker, vpn, server, test clients, test data
  • [ ] prod : docker, vpn, server, real clients, real data

Implement Experiment Class

In GitLab by @sssilvar on Jun 22, 2021, 11:32

An experiment class used by the researcher to train a model using Federated Learning:

Usage

# Include it at the moment of defining the experiment
class Experiment:
    def __init__(self,
                 tags: list,
                 model_class: fedbiomed.common.Torchnn.Module,
                 model_args: Dict,  # {'layers': 4, ...}
                 training_args: Dict, # {'epochs': 15, 'lr': 1e-3 ...}
                 rounds: int,
                 aggregator: fedbiomed.researcher.aggregators.Agregator,
                 client_selection_strategy: fedbiomed.researcher.strategy.Strategy = None # default: None
                 ):

        # TODO: FederatedDataset class
        self.data = search_data(tags)  # {'client_id1': [{data1_id: ,...},{data1_2d: ,...}], ...}

        # Create job
        # TODO: refactor Job should not retrieve data/clients
        self.job = Job( model=model_class,
                        model_args=model_args
                        training_args=training_args)
        
        # Define aggregator and client selection strategy
        self.aggregator = aggregator

        if client_selection_strategy is None:
            # Wait for all to share training results
            # Default behavior: Raise error with any failure
            self.client_selection_strategy = DefaultClientSelectionStrategy()
        else:
            self.client_selection_strategy = client_selection_strategy

        self.rounds = rounds
        self.last_updated_params_url = None

    def run(self, sync=True):
        if not sync:
            raise NotImplementedError("One day....")
        
        # Run experiment
        for round_i in range(self.rounds):
            # Sample clients using strategy (if given)
            self.job.clients = self.client_selection_strategy.sample_clients(self.data) #self.job.clients in self.data

            # Trigger training round on sampled clients
            self.job.start_clients_training_round(round=round_i)

            # Assert/refine strategy for the current round
            model_params, weights = self.client_selection_strategy.refine(self.job.training_replies[round_i])
        
            # Aggregate
            aggregated_params = self.aggregator.aggregate(model_params, weights)

            # Make it available for clients
            self.last_updated_params_url = self.job.update_parameters(aggregated_params)

Attributes (components)

  • logger (could be a separated ExperimentLogger class): stores the output of the clients during the experiment (issue to be opened)
  • strategy: For client sampling and weighting during experiment #38
  • federator: For model aggregation/combination

Method

  • run(sync=True): it will execute experiment default: synchronously (async is not yet considered a priority)

Error handling

intermediate CI script - [merged]

In GitLab by @mvesin on Jun 28, 2021, 16:11

Merges feature/test_ci -> develop

  • CI script configures/cleans environment, docker, conda on slave

  • ... but no payload yet

  • plus minor corrections (typo in README.md, no docker container stoppped in fedbiomed_environment clean)

Strategy Client selection

In GitLab by @sssilvar on Jun 22, 2021, 10:10

One of the problems of fairly learning a model in Federated Learning is How to pick the right clients to learn a model?. Sometimes, aggregating all the models coming for the clients can lead to biased results.
NOTE: Synchronus training is enough for now

Therefore, it is necessary to have a ClientSelectionStrategy class. This class would be in charge of how to dynamically trigger client training (and aggregating their local models) at each round.

  • This Strategy class should be defined at the beginning of an experiment/Job
  • When no defined to an experiment/job default behavior is aggregating all clients
  • Researchers should be able to propose their own strategy by extending the class
  • Fedbiomed should provide some default strategies:
    • UniformSamplingClientStrategy
    • MultiNomialSamplingClientStrategy

Usage

At the definition of the experiment/job:

# Define a Stratety for client selection before starting a experiment
strategy = UniformSamplingClientStrategy()

# Define aggregator
fedavg = FedAverage()

# Include it at the moment of defining the experiment
experiment = Experiment(model_class=Net,
                 training_function=train,
                 model_args=model_args,
                 training_args=training_args,
                 rounds=10,  # Future functionality: Add rounds inside job definition
                 on_data=data,
                 aggregator=fedavg,  # Future functionality: include aggregator
                 client_selection_strategy=strategy # default: None
)

# Run Experiment (default synchronous)
# currently experiment fails if researcher fails/timeouts or infra (file repo/mqtt) fails
experiment.run()

Functionalities

  • Strategies need to keep track of the clients that were selected at each round

SP7-item8 : security clearance for phase 2

In GitLab by @mvesin on Jun 21, 2021, 16:19

work with DPO and security officer for clearance for phase 2 clinical experiment :

  • partner CAL (and 2nd partner to be defined ?)
  • real pseudonymized medical data

SP5_SP6-item1 : re-organize fedbiomed-network

In GitLab by @mvesin on May 12, 2021, 16:48

  • move application tests (vs unit test) from node and researcher to network
  • move default (public) application configurations from node and researcher to network
  • [ ] re-write mqtt container configuration
  • [ ] re-write django container configuration
  • other ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.