Comments (13)
Here a little draft of how Researcher and Nodes interact wrt state_id
from fedbiomed.
First set of specifications for implementing initial version:
A node state is a set of data about the status of an experiment *after* a payload (~training/validation for now) execution on a node
- key:
- use state_id (not round number), state_id unique for a *payload* (eg: replaying a round after a breakpoint is a different payload)
- state_id generated by node and returned to researcher (one per node when executing a payload)
- None is used for first round, new node
- content:
- purpose: a node state is for saving the information needed by the "next round" (next payload ?) to execute (or a payload requested from breakpoint load ?)
- other content: aside from this needed content, a node state may contain optional information (eg: also received from researcher). Goal: double check, informational, provision for future extensions ?
- type of data: can contain information like: auxiliary variables (probably needed) and full optimizer state, secagg ids (probably optional), dataloader state (#418), torch persistent buffers (#529)
- for now, just implement auxiliary variables (also save job_id)
- security: can be used only by payloads with correct job_id
- versioning for node state
- implement breakpoint extension for node state
- choice for DB techno:
- for now, continue with tinyDB
- use distinct table in same tinyDB file
- pay attention to use same number of params for each entry "similar" (provision for migration to relational DB later ?)
- save model parameters & bigger data as files, not in DB
- implementation
- API: consider adapting `fedbiomed.common.secagg_manager.*SecaggManager`
- implement in `fedbiomed.node`
- pass argument as hierarchical/extendable, dict of `state > optimizer > optimodules > auxiliary variables`
- tinyDB: no key, make sure the state_id is unique in the DB
- node implementation recommandations for handling of node state:
- node state may be missing or incomplete on the node => node has to handle
-
- researcher may send some other value for a field of the stat => researcher value overrides node state
- job_id is the reference to list all the info (state DB, secagg DB) about one experiment (eg: for listing or cleaning)
- no cleaning implemented in first version
- should be easy to later add cleaning function by job_id from researcher
from fedbiomed.
we discussed the possibility of having an API for researchers that wish to trigger saving some object as part of the node state between rounds. Here are some possible use cases where this would be useful:
- imagine Scaffold was not implemented in Fed-BioMed, and a researcher wants to implement it. The natural thing would be to follow the implementation given in the original paper, where they rely on saving aux variables as part of the node state between rounds
- researcher wants to test a new optimizer with a different state (e.g. some additional momentum-like variables)
- there are several
persistent buffers
in pytorch that are currently not saved between rounds. A prime example is the batch normalization parameters. Indeed, we have issue #529 open to remind us that we should document this. I imagine that some of these persistent buffers will be automatically saved by our implementation, but I don't think we can foresee all possible situations of interest to researchers
For the uses cases above, I would argue that an API allowing the researcher to interact with the node state is desirable. In my head, it would include functions like TrainingPlan.save_to_node_state(id: str, obj: Serializable) -> None
and TrainingPlan.load_from_node_state(id: str, round: Optional[int] = None) -> Serializable
.
from fedbiomed.
I propose NodeStateManager
for the name of the object that will deal with Node
states
class NodeStateManager:
def __init__(self):
# createor load table in database
def _load_state(self, job_id, state_id):
# load in database state
# do a request thanks to state_id and check job_id is the same
def _save_state(self, state) -> state_id:
# save state in DB (at the moment only optimizer auxiliary variables but in the near
# future optimizer state -check how to do it for native pytorch - , and persistent buffer for pytorch). an the state of dataloader)
# For sklearn I guess we will have to save entire model through pickling
def get(self, job_id, state_id) -> Dict:
# run this method before the model training on each node: return state if it has been found
check_version()
if state_id is None:
return None
res = load(job_id, state_id)
if res is empty:
raise warning
return None
else:
return res
def add(self, job_id, state, state_version) -> state_id:
# run this method at the end of training
new_state_id = generate_state()
state_entry = {}
state_entry['version´] = state_version
state_entry['job_id´] = job_id
state_entry['state_id´] = new_state_id
state_entry.update(state_path)
self.save(state)
def _generate_state(self)-> state_id:
# generates state_id
def _check_version(self) -> bool:
# check that version of the state saved is compatible with version in `constants.py`
# return True if version are compatible, otherwise either returns False or tirgger an error
def remove(self, job_id=None, state_id=None) -> True:
raise NotImplementedError( )
def list_states(self, job_id) -> List[Dict]:
# returns all states made for one Job
return query(job_id)
For this US, we will have to extend load_sate
and save_state
for non declearn optimizers
from fedbiomed.
Changes introduced in Researcher
inor in a new class calledJob
, add the following mehtodsNodeStateAgent
that will be implemented on Node side
# def extract_last_node_state(self, round_number) -> Dict[str, str]:
# return self._state_collection[max(round_number ,0)]
# def collect_node_state(self, round_number : int ,Responses):
# extract couple (node_id, state_id) from Responses
# self._state_collection = {round_number: { node_id: state_id}}
def get_last_state(self) -> Dict[str, str]:
# return last sate_id for each node that have lately responded. If a Node havenot participated, send None
last_state = {}
for i in range(len(self._state_collection) , 0, -1 ):
for node_id in self._state_collection[i]:
last_state [ node_id] = self._state_collection[i].get(node_id)
return last_state
def initiate_state_collection(self):
self._state_collection : Dict[int, Dict[str, str]] = {}
self._state_collection[0] = {}
for node_id in self._data:
self._state_collection[0] [node_id] = None
def save_state_collection_from_bkpt(self):
# save state_collection into the breakpoints
def load_state_collection_from_bkpt(self)
# load state_collection from a breakpoint
from fedbiomed.
Thanks @ybouilla for the nice sketch of node/researcher interactions !
One point: for node2 in round3, I would strongly suggest using state_id=None
instead of state_id=1224
.
Reason: a state makes sense in the context of a specific payload. In this case, state_id=1224
was created in the context of the execution of round1, we have no guarantee it makes sense for round2 (eg we save some scaffold state in round1 but don't use scaffold anymore in round2, or the opposite).
So this means: treat a node that was not available or not selected in the previous round as a new node.
from fedbiomed.
Hi Marc, thanks for your answer. Actually, you mentioned a point that may need more discussion ! :)
Actually, there might be some cases where what you said make sense, but if we take the case of scaffold, in the paper, authors are selecting a few Nodes among the pool of Node (S
being the sample of selected Nodes). For scaffold implementation, you don't want to lose the auxiliary variables of each nodes that havenot communicated during the previous round.
But maybe this detail should belong to the Strategy
implementation, and that for now we want to keep things simple using only defaultstrategy
from fedbiomed.
Update 08.21: for now for this US, we will consider that all Nodes
are involved in the training, because we currently only support defaultstrategy
. Thus, Node
will send None
for the state_id if disconnected - and raise error.
But we agree that this remark should be born in mind, and commented in the code!
from fedbiomed.
Proposal for saving a Nodeś state
{
"version_state_id":'1.0.0',
"state_id": 1234,
"job_id": 4567,
"optimizer_state": { "optimizer_type": type(optim),
"state_path": /path/to/the/state },
"persistent_buffer": None # for future implementation
}
from fedbiomed.
from fedbiomed.
update 08.29
We will create methods in Rounds:
load_round_state
: where we will detail all the element that need to be saved in a Round
and save_round_state
(self explanatory)
ex:
def save_round_state(self):
states = {}
optim_state = self.optimizer.get_state()
Serializer.dump(optim_state, path)
optim_state.update({'optimizer_type': type(self.optimizer),
'state_path': path})
states.update(optim_state)
# one can add persistent buffer to this `state` variable, or whatever variable he needs to be saved, should be done at its own risk.
...
NodeStateManager.add(states)
from fedbiomed.
Updates:
- optimizer type checking will help make sure user has not changed its optimizer from one round to another
- we have to find a way to know how user has changed the optimizer parameters from one round to another. For
declearn
optimizers, this can be done using config (static parameters that user may change) and state (optimizer state, should be internal to FBM), which are well separated. For native torch optimizers, process is more complicated since we don't have such distinction - we might have a switch that enables/disables the use of the Node states, because loading and saving things form/ into database takes time
from fedbiomed.
additional work
imagine Scaffold was not implemented in Fed-BioMed, and a researcher wants to implement it. The natural thing would be to follow the implementation given in the original paper, where they rely on saving aux variables as part of the node state between rounds
researcher wants to test a new optimizer with a different state (e.g. some additional momentum-like variables)
there are several persistent buffers in pytorch that are currently not saved between rounds. A prime example is the batch normalization parameters. Indeed, we have issue
#529 open to remind us that we should document this. I imagine that some of these persistent buffers will be automatically saved by our implementation, but I don't think we can foresee all possible situations of interest to researchers
We also have to consider the fact that thanks to Node saving state, it is possible to have a true validation dataset, and not a randomly generated one at each Round, so we can check things like loss function over-fitting and so on...
from fedbiomed.
Related Issues (20)
- Create message types for additive secret sharing HOT 1
- `SecaggSetup` (node) implementation for additive secret sharing on node
- Create researcher `SecaggAdditiveKeyContext` to launch the setup phase for JL secagg using additive secret sharing
- Implement node endpoint for N2N message to handle `AddtiveSSharingRequest`, `AdditiveSSharingReply` in `fedbiomed/node/request/_n2n_controller.py`
- Merge all the tasks and test additive secret sharing
- Create researcher `SecaggKeyContext ` for additive secret sharing in `fedbiomed/researcher/secagg/secagg_context` HOT 1
- Researcher notebook requires authentication HOT 1
- Secure node to node communication for honest but curious scenario HOT 1
- Handle the request `secagg-additive-ss-setup-request` in the `Node` class
- Nonce security in LOM secure aggregation
- Implement `serialize` and `desearialize` methods for Message classes
- Remove MP-SPDZ dependency
- Design of secure node to node communication for honest but curious scenario doing
- Use symmetric encryption for node to node communications
- Unified interface to send messages on node side
- [New issue]: Redesign `nodes.requests` module
- batch_size issue
- Improve checks for `Message` class
- Experiment run returns unclear message if given node id is not existing in gRPC server
- LOM secure aggregation fails with 10+ nodes
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fedbiomed.