Git Product home page Git Product logo

Comments (13)

ybouilla avatar ybouilla commented on September 26, 2024 1

Here a little draft of how Researcher and Nodes interact wrt state_id

node state drawio(2)

from fedbiomed.

mvesin avatar mvesin commented on September 26, 2024

First set of specifications for implementing initial version:

A node state is a set of data about the status of an experiment *after* a payload (~training/validation for now) execution on a node 
    - key:
        - use state_id (not round number), state_id unique for a *payload* (eg: replaying a round after a breakpoint is a different payload)
        - state_id generated by node and returned to researcher (one per node when executing a payload)
        - None is used for first round, new node
    - content:
        - purpose: a node state is for saving the information needed by the "next round" (next payload ?) to execute (or a payload requested from breakpoint load ?)
        - other content: aside from this needed content, a node state may contain optional information (eg: also received from researcher). Goal: double check, informational, provision for future extensions ?
        - type of data: can contain information like: auxiliary variables (probably needed) and full optimizer state, secagg ids (probably optional), dataloader state (#418), torch persistent buffers (#529)
        - for now, just implement auxiliary variables (also save job_id)
    - security: can be used only by payloads with correct job_id
    - versioning for node state
    - implement breakpoint extension for node state
    - choice for DB techno:
        - for now, continue with tinyDB
        - use distinct table in same tinyDB file
        - pay attention to use same number of params for each entry "similar" (provision for migration to relational DB later ?)
        - save model parameters & bigger data as files, not in DB
    - implementation
        - API: consider adapting `fedbiomed.common.secagg_manager.*SecaggManager`
        - implement in `fedbiomed.node`
        - pass argument as hierarchical/extendable, dict of `state > optimizer > optimodules > auxiliary variables` 
        - tinyDB: no key, make sure the state_id is unique in the DB
    - node implementation recommandations for handling of node state:
        - node state may be missing or incomplete on the node => node has to handle
            - 
        - researcher may send some other value for a field of the stat => researcher value overrides node state
    - job_id is the reference to list all the info (state DB, secagg DB) about one experiment (eg: for listing or cleaning)
    - no cleaning implemented in first version
        - should be easy to later add cleaning function by job_id from researcher

from fedbiomed.

sharkovsky avatar sharkovsky commented on September 26, 2024

we discussed the possibility of having an API for researchers that wish to trigger saving some object as part of the node state between rounds. Here are some possible use cases where this would be useful:

  • imagine Scaffold was not implemented in Fed-BioMed, and a researcher wants to implement it. The natural thing would be to follow the implementation given in the original paper, where they rely on saving aux variables as part of the node state between rounds
  • researcher wants to test a new optimizer with a different state (e.g. some additional momentum-like variables)
  • there are several persistent buffers in pytorch that are currently not saved between rounds. A prime example is the batch normalization parameters. Indeed, we have issue #529 open to remind us that we should document this. I imagine that some of these persistent buffers will be automatically saved by our implementation, but I don't think we can foresee all possible situations of interest to researchers

For the uses cases above, I would argue that an API allowing the researcher to interact with the node state is desirable. In my head, it would include functions like TrainingPlan.save_to_node_state(id: str, obj: Serializable) -> None and TrainingPlan.load_from_node_state(id: str, round: Optional[int] = None) -> Serializable.

from fedbiomed.

ybouilla avatar ybouilla commented on September 26, 2024

I propose NodeStateManager for the name of the object that will deal with Node states

class NodeStateManager:
    def __init__(self):
          # createor load table in database

    def _load_state(self, job_id, state_id):
          # load in database state
          # do a request thanks to state_id and check job_id is the same

   def _save_state(self, state) -> state_id:
         # save state in DB (at the moment only optimizer auxiliary variables but in the near
# future optimizer state -check how to do it for native pytorch - , and persistent buffer for pytorch). an the state of dataloader) 
# For sklearn I guess we will have to  save entire model through pickling

    
def get(self, job_id, state_id) -> Dict:
     # run this method before the model training on each node:  return state if it has been found
     check_version()
    if state_id is None:
           return None
    res = load(job_id, state_id)
    if res is empty:
            raise warning
            return None
   else:
            return res

def add(self, job_id, state, state_version) -> state_id:
       # run this method at the end of training
       new_state_id = generate_state()
       state_entry = {}
       state_entry['version´]  = state_version
       state_entry['job_id´] = job_id
       state_entry['state_id´] = new_state_id
       state_entry.update(state_path)
       self.save(state)  
        
def _generate_state(self)-> state_id:
        # generates state_id

    def _check_version(self) -> bool:
         # check that version of the state saved is compatible with version in `constants.py`
         # return True if version are compatible, otherwise either returns False or tirgger an error

def remove(self, job_id=None, state_id=None) -> True:
       raise NotImplementedError( )

def list_states(self, job_id) -> List[Dict]:
     # returns all states made for one Job
     return query(job_id)

For this US, we will have to extend load_sate and save_state for non declearn optimizers

from fedbiomed.

ybouilla avatar ybouilla commented on September 26, 2024

Changes introduced in Researcher

  1. in Job, add the following mehtods or in a new class called NodeStateAgent that will be implemented on Node side

# def extract_last_node_state(self, round_number) -> Dict[str, str]:
#       return self._state_collection[max(round_number ,0)]

# def collect_node_state(self, round_number : int ,Responses):
       # extract couple (node_id, state_id) from  Responses
#      self._state_collection = {round_number: { node_id: state_id}}

def get_last_state(self) -> Dict[str, str]:
     # return last sate_id for each node that have lately responded. If a Node havenot participated, send None
     last_state = {}
    for i in range(len(self._state_collection) , 0, -1 ):
         for node_id in  self._state_collection[i]:
                 last_state [ node_id] = self._state_collection[i].get(node_id)
     return last_state
    
def initiate_state_collection(self):
       self._state_collection : Dict[int, Dict[str, str]] = {}
       self._state_collection[0] = {}
       for node_id in self._data:
               self._state_collection[0] [node_id] = None

def save_state_collection_from_bkpt(self):
       # save state_collection into the breakpoints

def load_state_collection_from_bkpt(self)
     # load state_collection from a breakpoint

from fedbiomed.

mvesin avatar mvesin commented on September 26, 2024

Thanks @ybouilla for the nice sketch of node/researcher interactions !

One point: for node2 in round3, I would strongly suggest using state_id=None instead of state_id=1224.

Reason: a state makes sense in the context of a specific payload. In this case, state_id=1224 was created in the context of the execution of round1, we have no guarantee it makes sense for round2 (eg we save some scaffold state in round1 but don't use scaffold anymore in round2, or the opposite).

So this means: treat a node that was not available or not selected in the previous round as a new node.

from fedbiomed.

ybouilla avatar ybouilla commented on September 26, 2024

Hi Marc, thanks for your answer. Actually, you mentioned a point that may need more discussion ! :)
Actually, there might be some cases where what you said make sense, but if we take the case of scaffold, in the paper, authors are selecting a few Nodes among the pool of Node (S being the sample of selected Nodes). For scaffold implementation, you don't want to lose the auxiliary variables of each nodes that havenot communicated during the previous round.

But maybe this detail should belong to the Strategy implementation, and that for now we want to keep things simple using only defaultstrategy

from fedbiomed.

ybouilla avatar ybouilla commented on September 26, 2024

Update 08.21: for now for this US, we will consider that all Nodes are involved in the training, because we currently only support defaultstrategy. Thus, Node will send None for the state_id if disconnected - and raise error.
But we agree that this remark should be born in mind, and commented in the code!

from fedbiomed.

ybouilla avatar ybouilla commented on September 26, 2024

Proposal for saving a Nodeś state

{
"version_state_id":'1.0.0',
"state_id": 1234,
"job_id": 4567,
"optimizer_state": { "optimizer_type":  type(optim),
                                    "state_path":  /path/to/the/state },
"persistent_buffer": None  # for future implementation
}

from fedbiomed.

ybouilla avatar ybouilla commented on September 26, 2024

Node state proposal(2) drawio

from fedbiomed.

ybouilla avatar ybouilla commented on September 26, 2024

update 08.29

We will create methods in Rounds:
load_round_state: where we will detail all the element that need to be saved in a Round
and save_round_state (self explanatory)

ex:

def save_round_state(self):
      states = {}
      optim_state = self.optimizer.get_state()
      Serializer.dump(optim_state, path)
     optim_state.update({'optimizer_type': type(self.optimizer),
                                           'state_path': path})
    states.update(optim_state)

    # one can add persistent buffer to this `state` variable, or whatever variable he needs to be saved, should be done at its own risk. 
    ...
   
   NodeStateManager.add(states)

from fedbiomed.

ybouilla avatar ybouilla commented on September 26, 2024

Updates:

  • optimizer type checking will help make sure user has not changed its optimizer from one round to another
  • we have to find a way to know how user has changed the optimizer parameters from one round to another. For declearn optimizers, this can be done using config (static parameters that user may change) and state (optimizer state, should be internal to FBM), which are well separated. For native torch optimizers, process is more complicated since we don't have such distinction
  • we might have a switch that enables/disables the use of the Node states, because loading and saving things form/ into database takes time

from fedbiomed.

ybouilla avatar ybouilla commented on September 26, 2024

additional work

imagine Scaffold was not implemented in Fed-BioMed, and a researcher wants to implement it. The natural thing would be to follow the implementation given in the original paper, where they rely on saving aux variables as part of the node state between rounds
researcher wants to test a new optimizer with a different state (e.g. some additional momentum-like variables)
there are several persistent buffers in pytorch that are currently not saved between rounds. A prime example is the batch normalization parameters. Indeed, we have issue
#529 open to remind us that we should document this. I imagine that some of these persistent buffers will be automatically saved by our implementation, but I don't think we can foresee all possible situations of interest to researchers

We also have to consider the fact that thanks to Node saving state, it is possible to have a true validation dataset, and not a randomly generated one at each Round, so we can check things like loss function over-fitting and so on...

from fedbiomed.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.