fedbiomed / fedbiomed Goto Github PK

configure/clean environment for CI
first payload : integration test for running simplenet/fedavg/MNIST test with 1 client on a few batches, do not check accuracy of resulting model

SP5_SP6-item1 : development environment

In GitLab by @massal on May 6, 2021, 13:04

Create an developement environment for executing Fed-BioMed :

localhost
conda

misc adapt for path renaming - [merged]

In GitLab by @mvesin on Jun 7, 2021, 11:52

Merges feature/repath -> develop

SP5_SP6-item6 : mesure execution time on each client

In GitLab by @mvesin on May 12, 2021, 18:29

SP5_SP6-item0 : environments

In GitLab by @mvesin on May 12, 2021, 16:52

Describe and implement environments matching the life cycle of the application :

development 1 : localhost, conda
~~[ ] development 2 : docker, vpn, localhost~~
~~[ ] preprod : docker, vpn, server, test clients, test data~~
~~[ ] prod : docker, vpn, server, real clients, real data~~

SP7-item8 : security clearance for phase 2

In GitLab by @mvesin on Jun 21, 2021, 16:19

work with DPO and security officer for clearance for phase 2 clinical experiment :

partner CAL (and 2nd partner to be defined ?)
real pseudonymized medical data

SP7_item8 : prepare first clinical experiment with CAL

In GitLab by @mvesin on Jun 16, 2021, 12:35

SP7-item6 : mesure execution time

In GitLab by @mvesin on Jul 5, 2021, 09:26

local execution time of a training function on a node (real time, process time)
total execution time for a training request on a dataset on a node

SP5_SP6-item3 : train MNIST from command line

In GitLab by @mvesin on May 12, 2021, 18:12

SP5-item1 : rewrite imports

In GitLab by @massal on May 6, 2021, 12:56

Try follow PEP-8 including :

remove relative path
top of file
ordered standard/3rd party libs/application

Other :

avoid code execution in __init__.py
keep part of import path (eg from xxx import yyy or import xxx.yyy rather than from xxx.yyy import func1, func2)

DefaultStrategy class

In GitLab by @jsaray on Jun 22, 2021, 16:29

Code the class DefaultStrategy as it is shown in "Implement Experiment Class", this class is the simplest case where there is no sampling (all clients are choosen) and it should abort if any of the clients dont return. This task is a specialization of #38 after developing Experiment pseudo-code

SP5_SP6-item6 : handle case when client does not answer or returns an error

In GitLab by @mvesin on May 12, 2021, 18:28

May be due to node, mqtt, django or researcher (queue...) failure.

SP5_SP6-item1 : code architecture

In GitLab by @massal on May 6, 2021, 12:57

create classes in modules
avoid code outside of classes in modules
group code in classes (eg: build message contents only in messaging/node classes, not in repository/json)
rename modules (eg: mqtt to message or messaging)
path structure for modules (eg rename fedbiomed_cli to fedbiomed/node, etc.)
~~[ ] private methods/variables for modules~~
others ?

Implement Experiment Class

In GitLab by @sssilvar on Jun 22, 2021, 11:32

An experiment class used by the researcher to train a model using Federated Learning:

Usage

# Include it at the moment of defining the experiment
class Experiment:
    def __init__(self,
                 tags: list,
                 model_class: fedbiomed.common.Torchnn.Module,
                 model_args: Dict,  # {'layers': 4, ...}
                 training_args: Dict, # {'epochs': 15, 'lr': 1e-3 ...}
                 rounds: int,
                 aggregator: fedbiomed.researcher.aggregators.Agregator,
                 client_selection_strategy: fedbiomed.researcher.strategy.Strategy = None # default: None
                 ):

        # TODO: FederatedDataset class
        self.data = search_data(tags)  # {'client_id1': [{data1_id: ,...},{data1_2d: ,...}], ...}

        # Create job
        # TODO: refactor Job should not retrieve data/clients
        self.job = Job( model=model_class,
                        model_args=model_args
                        training_args=training_args)
        
        # Define aggregator and client selection strategy
        self.aggregator = aggregator

        if client_selection_strategy is None:
            # Wait for all to share training results
            # Default behavior: Raise error with any failure
            self.client_selection_strategy = DefaultClientSelectionStrategy()
        else:
            self.client_selection_strategy = client_selection_strategy

        self.rounds = rounds
        self.last_updated_params_url = None

    def run(self, sync=True):
        if not sync:
            raise NotImplementedError("One day....")
        
        # Run experiment
        for round_i in range(self.rounds):
            # Sample clients using strategy (if given)
            self.job.clients = self.client_selection_strategy.sample_clients(self.data) #self.job.clients in self.data

            # Trigger training round on sampled clients
            self.job.start_clients_training_round(round=round_i)

            # Assert/refine strategy for the current round
            model_params, weights = self.client_selection_strategy.refine(self.job.training_replies[round_i])
        
            # Aggregate
            aggregated_params = self.aggregator.aggregate(model_params, weights)

            # Make it available for clients
            self.last_updated_params_url = self.job.update_parameters(aggregated_params)

Attributes (components)

logger (could be a separated ExperimentLogger class): stores the output of the clients during the experiment (issue to be opened)
strategy: For client sampling and weighting during experiment #38
federator: For model aggregation/combination

Method

run(sync=True): it will execute experiment default: synchronously (async is not yet considered a priority)

Error handling

Discussed in:
- #39
- #40
- #41

send model from researcher to node & do a full training round - [merged]

In GitLab by @mvesin on Jun 16, 2021, 11:02

Merges feature/model_send -> develop

add import for whole module instead of just class

Job, Aggregator (and maybe more) code refactoring

In GitLab by @jsaray on Jun 22, 2021, 16:33

The new design would imply to refactor these classes (and even more),

SP5_SP6-item4 : Model class

In GitLab by @massal on May 6, 2021, 12:58

SP7-item3 : add CI check for python style

In GitLab by @mvesin on May 12, 2021, 17:32

verify python style on some events (flake8, pylint, pychecker or other)
clean current code version to match style requirements

SP5-SP6_item3 : MNIST federated training convergence problem

In GitLab by @mvesin on Jun 16, 2021, 12:33

Hints :

a bug in the current code ?
need to send mode than just the model params (loss, etc.) or full pytorch checkpoint https://pytorch.org/tutorials/recipes/recipes/saving_and_loading_a_general_checkpoint.html ?
other ?

[...]
Launching node...
     - Starting communication channel with network...
[...]
# For round 1 we use dry_run so no real training occurs
Train Epoch: 1 [0/60000 (0%)]    Loss: 2.291022
Uploading model parameters to fc8303d2-aad2-46e9-8c63-2f58abdff401.pt
[...]
# For round 2 and 3 we do not use dry_run
INFO] Training on dataset: /data/mvesin/data
Train Epoch: 1 [0/60000 (0%)]    Loss: 2.278209
Train Epoch: 1 [480/60000 (1%)]    Loss: 1.013210
Train Epoch: 1 [960/60000 (2%)]    Loss: 0.590519
Train Epoch: 1 [1440/60000 (2%)]    Loss: 0.542856
Train Epoch: 1 [1920/60000 (3%)]    Loss: 0.491376
....
Train Epoch: 1 [59040/60000 (98%)]    Loss: 0.068433
Train Epoch: 1 [59520/60000 (99%)]    Loss: 0.100257
Uploading model parameters to 2b55e6f1-c861-4a45-a166-c35a58397d12.pt
# Looks more or less or for round 2, loss is decreasing until a certain point
[...]
# But for round 3, loss restarts from initial value : is this normal ???
[INFO] Training on dataset: /data/mvesin/data
Train Epoch: 1 [0/60000 (0%)]    Loss: 2.338098
Train Epoch: 1 [480/60000 (1%)]    Loss: 1.019898
Train Epoch: 1 [960/60000 (2%)]    Loss: 0.952290
Train Epoch: 1 [1440/60000 (2%)]    Loss: 0.761609
Train Epoch: 1 [1920/60000 (3%)]    Loss: 0.493840
...
Train Epoch: 1 [58560/60000 (98%)]    Loss: 0.050281
Train Epoch: 1 [59040/60000 (98%)]    Loss: 0.140764
Train Epoch: 1 [59520/60000 (99%)]    Loss: 0.095856
# We not better after round 3 : not enough data to converge or bug ?

SP5-SP6_item1 : re-design researcher classes Job() and FedAverage()

In GitLab by @mvesin on Jun 16, 2021, 12:23

user interfaces
internals

SP7-item8 : staff development team

In GitLab by @mvesin on May 12, 2021, 18:35

staff senior developper position
re-staff developper position
integrate new team member (JS)

SP5_SP6-item1 : re-organize fedbiomed-network

In GitLab by @mvesin on May 12, 2021, 16:48

move application tests (vs unit test) from node and researcher to network
move default (public) application configurations from node and researcher to network
~~[ ] re-write mqtt container configuration~~
~~[ ] re-write django container configuration~~
other ?

SP7-item3 : icons for gitlab repos

In GitLab by @mvesin on May 12, 2021, 17:38

switch to icons with proper license

SP5-item1: configuration file per id

In GitLab by @massal on May 6, 2021, 13:07

Create a different config.ini per node/researcher id to handle multiple instances running from the same development env

SP5_SP6-item4 : dont use github

In GitLab by @mvesin on May 12, 2021, 18:18

Use alternate file repo (eg existing django) for model and training function, to remove github.com dependency

SP7-item3 : add CI on a test case

In GitLab by @mvesin on May 12, 2021, 17:21

SP7-item5 : "phase 3" - Synchronous Training Experiment Resuming

In GitLab by @sssilvar on Jun 22, 2021, 10:33

Experiment should be able to resume from the last round (checkpoint) where it was successful.

Usage

# Example of failure due to client timeout
In []: experiment.run()
Out []: RuntimeError "Client not responding (Timeout Error)"

# Resume training (executing again)
In []: experiment.run()
Out []: warning: "Resuming experiment from round X..."

SP5_SP6-item6 : local training

In GitLab by @mvesin on May 12, 2021, 18:26

implement local training
compare with fed training (accuracy)
~~[ ] compare federated training with local training in CI test case~~

SP5_SP6-item3 : upload init.pt to file repo from the researcher side

In GitLab by @massal on May 6, 2021, 13:05

SP5_SP6-item6 : file repo organization

In GitLab by @mvesin on May 12, 2021, 18:31

add structure to django file repo : (per researcher), per client, per job
clean old files
move results of job from file repo to researcher side

SP5_SP6-item1 : test code stability

In GitLab by @massal on May 6, 2021, 13:03

Test behaviour under failure scenarios during training : mqtt failure, django failure, node failure, researcher failure

SP5_SP6-item1 : node interface

In GitLab by @mvesin on May 12, 2021, 17:47

Re-write node user interface (fedbiomed_cli.utils.cli:main)

Strategy Client selection

In GitLab by @sssilvar on Jun 22, 2021, 10:10

One of the problems of fairly learning a model in Federated Learning is How to pick the right clients to learn a model?. Sometimes, aggregating all the models coming for the clients can lead to biased results.
NOTE: Synchronus training is enough for now

Therefore, it is necessary to have a ClientSelectionStrategy class. This class would be in charge of how to dynamically trigger client training (and aggregating their local models) at each round.

This Strategy class should be defined at the beginning of an experiment/Job
When no defined to an experiment/job default behavior is aggregating all clients
Researchers should be able to propose their own strategy by extending the class
Fedbiomed should provide some default strategies:
- UniformSamplingClientStrategy
- MultiNomialSamplingClientStrategy

Usage

At the definition of the experiment/job:

# Define a Stratety for client selection before starting a experiment
strategy = UniformSamplingClientStrategy()

# Define aggregator
fedavg = FedAverage()

# Include it at the moment of defining the experiment
experiment = Experiment(model_class=Net,
                 training_function=train,
                 model_args=model_args,
                 training_args=training_args,
                 rounds=10,  # Future functionality: Add rounds inside job definition
                 on_data=data,
                 aggregator=fedavg,  # Future functionality: include aggregator
                 client_selection_strategy=strategy # default: None
)

# Run Experiment (default synchronous)
# currently experiment fails if researcher fails/timeouts or infra (file repo/mqtt) fails
experiment.run()

Functionalities

Strategies need to keep track of the clients that were selected at each round

merge or rebalance code between fedbiomed-node and fedbiomed-researcher ?
add private repo for Inria experiments (configs, datasets, results) ?

SP5_SP6-item1 : clean execution in dev env

In GitLab by @massal on May 6, 2021, 13:08

On fedbiomed-researcher

separate code from config file
separate code from db, queue (var directory ?)
documentation and/or command for cleaning environment (config files, db, queue, cache torchhub)

On fedbiomed-node :

separate code from config file
separate code from db, queue (var directory ?)
documentation and/or command for cleaning environment (config files, db, queue, cache torchhub)

continuous integration misc - [merged]

In GitLab by @mvesin on Jun 30, 2021, 17:16

Merges feature/misc_ci -> develop

adapt script fedbiomed_run for auto add/delete mnist dataset

SP5_SP6-item7 : update phase 1 security clearance from Inria security team

In GitLab by @mvesin on May 12, 2021, 18:34

match new implementation
include first test with clinical partner on non sensitive data

intermediate CI script - [merged]

In GitLab by @mvesin on Jun 28, 2021, 16:11

Merges feature/test_ci -> develop

CI script configures/cleans environment, docker, conda on slave
... but no payload yet
plus minor corrections (typo in README.md, no docker container stoppped in fedbiomed_environment clean)

fedbiomed / fedbiomed Goto Github PK

fedbiomed's Issues

Usage

Attributes (components)

Method

Error handling

Usage

Usage

Functionalities

Recommend Projects

Recommend Topics

Recommend Org