Git Product home page Git Product logo

fedbiomed's Issues

SP10-item1 : add containers and vpn

In GitLab by @mvesin on May 12, 2021, 18:22

Build containers for

  • fedbiomed-node
  • fedbiomed-researcher
  • mqtt
  • htpp
    with
  • wireguard VPN support,

based on Fed-BioMed v2 VPN'ization

setup CI first version - [merged]

In GitLab by @mvesin on Jun 30, 2021, 12:36

Merges feature/test_ci -> develop

  • configure/clean environment for CI
  • first payload : integration test for running simplenet/fedavg/MNIST test with 1 client on a few batches, do not check accuracy of resulting model

SP5_SP6-item0 : environments

In GitLab by @mvesin on May 12, 2021, 16:52

Describe and implement environments matching the life cycle of the application :

  • development 1 : localhost, conda
  • [ ] development 2 : docker, vpn, localhost
  • [ ] preprod : docker, vpn, server, test clients, test data
  • [ ] prod : docker, vpn, server, real clients, real data

SP7-item8 : security clearance for phase 2

In GitLab by @mvesin on Jun 21, 2021, 16:19

work with DPO and security officer for clearance for phase 2 clinical experiment :

  • partner CAL (and 2nd partner to be defined ?)
  • real pseudonymized medical data

SP7-item6 : mesure execution time

In GitLab by @mvesin on Jul 5, 2021, 09:26

  • local execution time of a training function on a node (real time, process time)
  • total execution time for a training request on a dataset on a node

SP5-item1 : rewrite imports

In GitLab by @massal on May 6, 2021, 12:56

Try follow PEP-8 including :

  • remove relative path
  • top of file
  • ordered standard/3rd party libs/application

Other :

  • avoid code execution in __init__.py
  • keep part of import path (eg from xxx import yyy or import xxx.yyy rather than from xxx.yyy import func1, func2)

DefaultStrategy class

In GitLab by @jsaray on Jun 22, 2021, 16:29

Code the class DefaultStrategy as it is shown in "Implement Experiment Class", this class is the simplest case where there is no sampling (all clients are choosen) and it should abort if any of the clients dont return. This task is a specialization of #38 after developing Experiment pseudo-code

SP5_SP6-item1 : code architecture

In GitLab by @massal on May 6, 2021, 12:57

  • create classes in modules
  • avoid code outside of classes in modules
  • group code in classes (eg: build message contents only in messaging/node classes, not in repository/json)
  • rename modules (eg: mqtt to message or messaging)
  • path structure for modules (eg rename fedbiomed_cli to fedbiomed/node, etc.)
  • [ ] private methods/variables for modules
  • others ?

Implement Experiment Class

In GitLab by @sssilvar on Jun 22, 2021, 11:32

An experiment class used by the researcher to train a model using Federated Learning:

Usage

# Include it at the moment of defining the experiment
class Experiment:
    def __init__(self,
                 tags: list,
                 model_class: fedbiomed.common.Torchnn.Module,
                 model_args: Dict,  # {'layers': 4, ...}
                 training_args: Dict, # {'epochs': 15, 'lr': 1e-3 ...}
                 rounds: int,
                 aggregator: fedbiomed.researcher.aggregators.Agregator,
                 client_selection_strategy: fedbiomed.researcher.strategy.Strategy = None # default: None
                 ):

        # TODO: FederatedDataset class
        self.data = search_data(tags)  # {'client_id1': [{data1_id: ,...},{data1_2d: ,...}], ...}

        # Create job
        # TODO: refactor Job should not retrieve data/clients
        self.job = Job( model=model_class,
                        model_args=model_args
                        training_args=training_args)
        
        # Define aggregator and client selection strategy
        self.aggregator = aggregator

        if client_selection_strategy is None:
            # Wait for all to share training results
            # Default behavior: Raise error with any failure
            self.client_selection_strategy = DefaultClientSelectionStrategy()
        else:
            self.client_selection_strategy = client_selection_strategy

        self.rounds = rounds
        self.last_updated_params_url = None

    def run(self, sync=True):
        if not sync:
            raise NotImplementedError("One day....")
        
        # Run experiment
        for round_i in range(self.rounds):
            # Sample clients using strategy (if given)
            self.job.clients = self.client_selection_strategy.sample_clients(self.data) #self.job.clients in self.data

            # Trigger training round on sampled clients
            self.job.start_clients_training_round(round=round_i)

            # Assert/refine strategy for the current round
            model_params, weights = self.client_selection_strategy.refine(self.job.training_replies[round_i])
        
            # Aggregate
            aggregated_params = self.aggregator.aggregate(model_params, weights)

            # Make it available for clients
            self.last_updated_params_url = self.job.update_parameters(aggregated_params)

Attributes (components)

  • logger (could be a separated ExperimentLogger class): stores the output of the clients during the experiment (issue to be opened)
  • strategy: For client sampling and weighting during experiment #38
  • federator: For model aggregation/combination

Method

  • run(sync=True): it will execute experiment default: synchronously (async is not yet considered a priority)

Error handling

SP5-SP6_item3 : MNIST federated training convergence problem

In GitLab by @mvesin on Jun 16, 2021, 12:33

Hints :

[...]
Launching node...
     - Starting communication channel with network...
[...]
# For round 1 we use dry_run so no real training occurs
Train Epoch: 1 [0/60000 (0%)]    Loss: 2.291022
Uploading model parameters to fc8303d2-aad2-46e9-8c63-2f58abdff401.pt
[...]
# For round 2 and 3 we do not use dry_run
INFO] Training on dataset: /data/mvesin/data
Train Epoch: 1 [0/60000 (0%)]    Loss: 2.278209
Train Epoch: 1 [480/60000 (1%)]    Loss: 1.013210
Train Epoch: 1 [960/60000 (2%)]    Loss: 0.590519
Train Epoch: 1 [1440/60000 (2%)]    Loss: 0.542856
Train Epoch: 1 [1920/60000 (3%)]    Loss: 0.491376
....
Train Epoch: 1 [59040/60000 (98%)]    Loss: 0.068433
Train Epoch: 1 [59520/60000 (99%)]    Loss: 0.100257
Uploading model parameters to 2b55e6f1-c861-4a45-a166-c35a58397d12.pt
# Looks more or less or for round 2, loss is decreasing until a certain point
[...]
# But for round 3, loss restarts from initial value : is this normal ???
[INFO] Training on dataset: /data/mvesin/data
Train Epoch: 1 [0/60000 (0%)]    Loss: 2.338098
Train Epoch: 1 [480/60000 (1%)]    Loss: 1.019898
Train Epoch: 1 [960/60000 (2%)]    Loss: 0.952290
Train Epoch: 1 [1440/60000 (2%)]    Loss: 0.761609
Train Epoch: 1 [1920/60000 (3%)]    Loss: 0.493840
...
Train Epoch: 1 [58560/60000 (98%)]    Loss: 0.050281
Train Epoch: 1 [59040/60000 (98%)]    Loss: 0.140764
Train Epoch: 1 [59520/60000 (99%)]    Loss: 0.095856
# We not better after round 3 : not enough data to converge or bug ?

SP5_SP6-item1 : re-organize fedbiomed-network

In GitLab by @mvesin on May 12, 2021, 16:48

  • move application tests (vs unit test) from node and researcher to network
  • move default (public) application configurations from node and researcher to network
  • [ ] re-write mqtt container configuration
  • [ ] re-write django container configuration
  • other ?

SP7-item5 : "phase 3" - Synchronous Training Experiment Resuming

In GitLab by @sssilvar on Jun 22, 2021, 10:33

Experiment should be able to resume from the last round (checkpoint) where it was successful.

Usage

# Example of failure due to client timeout
In []: experiment.run()
Out []: RuntimeError "Client not responding (Timeout Error)"

# Resume training (executing again)
In []: experiment.run()
Out []: warning: "Resuming experiment from round X..."

SP5_SP6-item6 : local training

In GitLab by @mvesin on May 12, 2021, 18:26

  • implement local training
  • compare with fed training (accuracy)
  • [ ] compare federated training with local training in CI test case

SP5_SP6-item6 : file repo organization

In GitLab by @mvesin on May 12, 2021, 18:31

  • add structure to django file repo : (per researcher), per client, per job
  • clean old files
  • move results of job from file repo to researcher side

Strategy Client selection

In GitLab by @sssilvar on Jun 22, 2021, 10:10

One of the problems of fairly learning a model in Federated Learning is How to pick the right clients to learn a model?. Sometimes, aggregating all the models coming for the clients can lead to biased results.
NOTE: Synchronus training is enough for now

Therefore, it is necessary to have a ClientSelectionStrategy class. This class would be in charge of how to dynamically trigger client training (and aggregating their local models) at each round.

  • This Strategy class should be defined at the beginning of an experiment/Job
  • When no defined to an experiment/job default behavior is aggregating all clients
  • Researchers should be able to propose their own strategy by extending the class
  • Fedbiomed should provide some default strategies:
    • UniformSamplingClientStrategy
    • MultiNomialSamplingClientStrategy

Usage

At the definition of the experiment/job:

# Define a Stratety for client selection before starting a experiment
strategy = UniformSamplingClientStrategy()

# Define aggregator
fedavg = FedAverage()

# Include it at the moment of defining the experiment
experiment = Experiment(model_class=Net,
                 training_function=train,
                 model_args=model_args,
                 training_args=training_args,
                 rounds=10,  # Future functionality: Add rounds inside job definition
                 on_data=data,
                 aggregator=fedavg,  # Future functionality: include aggregator
                 client_selection_strategy=strategy # default: None
)

# Run Experiment (default synchronous)
# currently experiment fails if researcher fails/timeouts or infra (file repo/mqtt) fails
experiment.run()

Functionalities

  • Strategies need to keep track of the clients that were selected at each round

notes centralization

In GitLab by @jsaray on Jun 21, 2021, 17:15

It would be good to centralize all notes taken in meetings in a website that all people can review
(inria box for example, or any other network file system.)

SP5_SP6-item0 : gitlab repositories

In GitLab by @mvesin on May 12, 2021, 17:11

(Re)define and implement gitlab repositories :

  • merge or rebalance code between fedbiomed-node and fedbiomed-researcher ?
  • add private repo for Inria experiments (configs, datasets, results) ?

SP5_SP6-item1 : clean execution in dev env

In GitLab by @massal on May 6, 2021, 13:08

On fedbiomed-researcher

  • separate code from config file
  • separate code from db, queue (var directory ?)
  • documentation and/or command for cleaning environment (config files, db, queue, cache torchhub)

On fedbiomed-node :

  • separate code from config file
  • separate code from db, queue (var directory ?)
  • documentation and/or command for cleaning environment (config files, db, queue, cache torchhub)

intermediate CI script - [merged]

In GitLab by @mvesin on Jun 28, 2021, 16:11

Merges feature/test_ci -> develop

  • CI script configures/cleans environment, docker, conda on slave

  • ... but no payload yet

  • plus minor corrections (typo in README.md, no docker container stoppped in fedbiomed_environment clean)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.