fedbiomed / fedbiomed Goto Github PK
View Code? Open in Web Editor NEWA collaborative learning framework for empowering biomedical research
Home Page: https://fedbiomed.org
License: Other
A collaborative learning framework for empowering biomedical research
Home Page: https://fedbiomed.org
License: Other
In GitLab by @mvesin on May 12, 2021, 18:11
In GitLab by @mvesin on May 12, 2021, 18:19
In GitLab by @mvesin on May 12, 2021, 17:49
Change from threads to processes for node
In GitLab by @mvesin on May 12, 2021, 16:45
Switch from PoC v3 repos to v3 repo
In GitLab by @mvesin on May 12, 2021, 18:22
Build containers for
fedbiomed-node
fedbiomed-researcher
mqtt
htpp
based on Fed-BioMed v2 VPN'ization
In GitLab by @massal on May 6, 2021, 13:01
job.start_clients_training_round()
fails beacuse django upload gives varying types.
In GitLab by @mvesin on Jun 30, 2021, 12:36
Merges feature/test_ci -> develop
In GitLab by @massal on May 6, 2021, 13:04
Create an developement environment for executing Fed-BioMed :
In GitLab by @mvesin on Jun 7, 2021, 11:52
Merges feature/repath -> develop
In GitLab by @mvesin on May 12, 2021, 18:29
In GitLab by @mvesin on May 12, 2021, 16:52
Describe and implement environments matching the life cycle of the application :
In GitLab by @mvesin on Jun 21, 2021, 16:19
work with DPO and security officer for clearance for phase 2 clinical experiment :
In GitLab by @mvesin on Jun 16, 2021, 12:35
In GitLab by @mvesin on Jul 5, 2021, 09:26
In GitLab by @mvesin on May 12, 2021, 18:12
In GitLab by @massal on May 6, 2021, 12:56
Try follow PEP-8 including :
Other :
__init__.py
from xxx import yyy
or import xxx.yyy
rather than from xxx.yyy import func1, func2
)In GitLab by @jsaray on Jun 22, 2021, 16:29
Code the class DefaultStrategy as it is shown in "Implement Experiment Class", this class is the simplest case where there is no sampling (all clients are choosen) and it should abort if any of the clients dont return. This task is a specialization of #38 after developing Experiment pseudo-code
In GitLab by @mvesin on May 12, 2021, 18:28
May be due to node, mqtt, django or researcher (queue...) failure.
In GitLab by @massal on May 6, 2021, 12:57
mqtt
to message
or messaging
)fedbiomed_cli
to fedbiomed/node
, etc.)In GitLab by @sssilvar on Jun 22, 2021, 11:32
An experiment class used by the researcher to train a model using Federated Learning:
# Include it at the moment of defining the experiment
class Experiment:
def __init__(self,
tags: list,
model_class: fedbiomed.common.Torchnn.Module,
model_args: Dict, # {'layers': 4, ...}
training_args: Dict, # {'epochs': 15, 'lr': 1e-3 ...}
rounds: int,
aggregator: fedbiomed.researcher.aggregators.Agregator,
client_selection_strategy: fedbiomed.researcher.strategy.Strategy = None # default: None
):
# TODO: FederatedDataset class
self.data = search_data(tags) # {'client_id1': [{data1_id: ,...},{data1_2d: ,...}], ...}
# Create job
# TODO: refactor Job should not retrieve data/clients
self.job = Job( model=model_class,
model_args=model_args
training_args=training_args)
# Define aggregator and client selection strategy
self.aggregator = aggregator
if client_selection_strategy is None:
# Wait for all to share training results
# Default behavior: Raise error with any failure
self.client_selection_strategy = DefaultClientSelectionStrategy()
else:
self.client_selection_strategy = client_selection_strategy
self.rounds = rounds
self.last_updated_params_url = None
def run(self, sync=True):
if not sync:
raise NotImplementedError("One day....")
# Run experiment
for round_i in range(self.rounds):
# Sample clients using strategy (if given)
self.job.clients = self.client_selection_strategy.sample_clients(self.data) #self.job.clients in self.data
# Trigger training round on sampled clients
self.job.start_clients_training_round(round=round_i)
# Assert/refine strategy for the current round
model_params, weights = self.client_selection_strategy.refine(self.job.training_replies[round_i])
# Aggregate
aggregated_params = self.aggregator.aggregate(model_params, weights)
# Make it available for clients
self.last_updated_params_url = self.job.update_parameters(aggregated_params)
logger
(could be a separated ExperimentLogger
class): stores the output of the clients during the experiment (issue to be opened)strategy
: For client sampling and weighting during experiment #38federator
: For model aggregation/combinationrun(sync=True)
: it will execute experiment default: synchronously (async is not yet considered a priority)In GitLab by @mvesin on Jun 16, 2021, 11:02
Merges feature/model_send -> develop
add import for whole module instead of just class
In GitLab by @jsaray on Jun 22, 2021, 16:33
The new design would imply to refactor these classes (and even more),
In GitLab by @massal on May 6, 2021, 12:58
In GitLab by @mvesin on May 12, 2021, 17:32
flake8
, pylint
, pychecker
or other)In GitLab by @mvesin on Jun 16, 2021, 12:33
Hints :
[...]
Launching node...
- Starting communication channel with network...
[...]
# For round 1 we use dry_run so no real training occurs
Train Epoch: 1 [0/60000 (0%)] Loss: 2.291022
Uploading model parameters to fc8303d2-aad2-46e9-8c63-2f58abdff401.pt
[...]
# For round 2 and 3 we do not use dry_run
INFO] Training on dataset: /data/mvesin/data
Train Epoch: 1 [0/60000 (0%)] Loss: 2.278209
Train Epoch: 1 [480/60000 (1%)] Loss: 1.013210
Train Epoch: 1 [960/60000 (2%)] Loss: 0.590519
Train Epoch: 1 [1440/60000 (2%)] Loss: 0.542856
Train Epoch: 1 [1920/60000 (3%)] Loss: 0.491376
....
Train Epoch: 1 [59040/60000 (98%)] Loss: 0.068433
Train Epoch: 1 [59520/60000 (99%)] Loss: 0.100257
Uploading model parameters to 2b55e6f1-c861-4a45-a166-c35a58397d12.pt
# Looks more or less or for round 2, loss is decreasing until a certain point
[...]
# But for round 3, loss restarts from initial value : is this normal ???
[INFO] Training on dataset: /data/mvesin/data
Train Epoch: 1 [0/60000 (0%)] Loss: 2.338098
Train Epoch: 1 [480/60000 (1%)] Loss: 1.019898
Train Epoch: 1 [960/60000 (2%)] Loss: 0.952290
Train Epoch: 1 [1440/60000 (2%)] Loss: 0.761609
Train Epoch: 1 [1920/60000 (3%)] Loss: 0.493840
...
Train Epoch: 1 [58560/60000 (98%)] Loss: 0.050281
Train Epoch: 1 [59040/60000 (98%)] Loss: 0.140764
Train Epoch: 1 [59520/60000 (99%)] Loss: 0.095856
# We not better after round 3 : not enough data to converge or bug ?
In GitLab by @mvesin on Jun 16, 2021, 12:23
In GitLab by @mvesin on May 12, 2021, 18:35
In GitLab by @mvesin on May 12, 2021, 16:48
In GitLab by @mvesin on May 12, 2021, 17:38
switch to icons with proper license
In GitLab by @massal on May 6, 2021, 13:07
Create a different config.ini
per node/researcher id to handle multiple instances running from the same development env
In GitLab by @mvesin on May 12, 2021, 18:18
Use alternate file repo (eg existing django) for model and training function, to remove github.com dependency
In GitLab by @mvesin on May 12, 2021, 17:21
In GitLab by @sssilvar on Jun 22, 2021, 10:33
Experiment should be able to resume from the last round (checkpoint) where it was successful.
# Example of failure due to client timeout
In []: experiment.run()
Out []: RuntimeError "Client not responding (Timeout Error)"
# Resume training (executing again)
In []: experiment.run()
Out []: warning: "Resuming experiment from round X..."
In GitLab by @mvesin on May 12, 2021, 18:26
In GitLab by @massal on May 6, 2021, 13:05
In GitLab by @mvesin on May 12, 2021, 18:31
In GitLab by @massal on May 6, 2021, 13:03
Test behaviour under failure scenarios during training : mqtt failure, django failure, node failure, researcher failure
In GitLab by @mvesin on May 12, 2021, 17:47
Re-write node user interface (fedbiomed_cli.utils.cli:main
)
In GitLab by @sssilvar on Jun 22, 2021, 10:10
One of the problems of fairly learning a model in Federated Learning is How to pick the right clients to learn a model?. Sometimes, aggregating all the models coming for the clients can lead to biased results.
NOTE: Synchronus training is enough for now
Therefore, it is necessary to have a ClientSelectionStrategy
class. This class would be in charge of how to dynamically trigger client training (and aggregating their local models) at each round.
UniformSamplingClientStrategy
MultiNomialSamplingClientStrategy
At the definition of the experiment/job:
# Define a Stratety for client selection before starting a experiment
strategy = UniformSamplingClientStrategy()
# Define aggregator
fedavg = FedAverage()
# Include it at the moment of defining the experiment
experiment = Experiment(model_class=Net,
training_function=train,
model_args=model_args,
training_args=training_args,
rounds=10, # Future functionality: Add rounds inside job definition
on_data=data,
aggregator=fedavg, # Future functionality: include aggregator
client_selection_strategy=strategy # default: None
)
# Run Experiment (default synchronous)
# currently experiment fails if researcher fails/timeouts or infra (file repo/mqtt) fails
experiment.run()
In GitLab by @mvesin on May 12, 2021, 18:00
Evaluate alternatives to MQTT :
In GitLab by @jsaray on Jun 21, 2021, 17:15
It would be good to centralize all notes taken in meetings in a website that all people can review
(inria box for example, or any other network file system.)
In GitLab by @mvesin on May 12, 2021, 17:19
Create fedbiomed-{node,network,researcher}
, configure access, hooks, etc.
In GitLab by @mvesin on May 12, 2021, 17:11
(Re)define and implement gitlab repositories :
In GitLab by @massal on May 6, 2021, 13:08
On fedbiomed-researcher
var
directory ?)On fedbiomed-node :
var
directory ?)In GitLab by @mvesin on Jun 30, 2021, 17:16
Merges feature/misc_ci -> develop
In GitLab by @mvesin on May 12, 2021, 18:34
In GitLab by @mvesin on Jun 28, 2021, 16:11
Merges feature/test_ci -> develop
CI script configures/cleans environment, docker, conda on slave
... but no payload yet
plus minor corrections (typo in README.md, no docker container stoppped in fedbiomed_environment clean)
In GitLab by @mvesin on May 12, 2021, 17:21
Configure CI (with a dummy payload) for new repos.
In GitLab by @mvesin on Jun 22, 2021, 11:19
In GitLab by @mvesin on Jun 22, 2021, 11:20
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.