Git Product home page Git Product logo

symbioticlab / fedscale Goto Github PK

View Code? Open in Web Editor NEW
385.0 11.0 121.0 80.7 MB

FedScale is a scalable and extensible open-source federated learning (FL) platform.

Home Page: https://fedscale.ai

License: Apache License 2.0

Shell 0.49% Python 69.18% MATLAB 0.04% C++ 26.00% Cuda 1.49% C 1.45% Cython 0.30% Java 1.04% CMake 0.02%
benchmark dataset machine-learning federated-learning icml deep-learning mlsys osdi deployment distributed

fedscale's People

Contributors

amberljc avatar andli28 avatar chuheng001 avatar continue-revolution avatar donggook-me avatar dywsjtu avatar ericdinging avatar etesami avatar ewenw avatar fanlai0990 avatar ikace avatar li1553770945 avatar liu-yunzhen avatar mosharaf avatar qinyeli avatar romero027 avatar samuelgong avatar singam-sanjay avatar xsh-sh avatar xuyangm avatar yuxuan18 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fedscale's Issues

GRPC error

hi, thanks for building this fl library!
I'm trying to run a toy example with FEMNIST. However, the program report grpc error as follows:

E0526 04:57:27.528504823   23761 chttp2_transport.cc:1115]   Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
E0526 04:57:27.528580558   23761 chttp2_transport.cc:1115]   Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
E0526 04:57:27.528625026   23761 client_channel.cc:691]      chand=0x559f9651a5c8: Illegal keepalive throttling value 9223372036854775807
(05-26) 04:58:28 INFO     [client.py:188] Training of (CLIENT: 836) completes, {'clientId': 836, 'moving_loss': 4.184056043624878, 'trained_size': 400, 'success': True, 'utility': 857.4471839768071}
Traceback (most recent call last):
  File "/mnt/gaodawei.gdw/FedScale/fedscale/core/executor.py", line 346, in <module>
    executor.run()
  File "/mnt/gaodawei.gdw/FedScale/fedscale/core/executor.py", line 141, in run
    self.event_monitor()
  File "/mnt/gaodawei.gdw/FedScale/fedscale/core/executor.py", line 318, in event_monitor
    client_id, train_res = self.Train(train_config)
  File "/mnt/gaodawei.gdw/FedScale/fedscale/core/executor.py", line 172, in Train
    meta_result = None, data_result = None
  File "/mnt/gaodawei.gdw/miniconda3/envs/fedscale/lib/python3.7/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/mnt/gaodawei.gdw/miniconda3/envs/fedscale/lib/python3.7/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Broken pipe"
	debug_error_string = "{"created":"@1653541108.923041466","description":"Error received from peer ipv4:0.0.0.0:1000","file":"src/core/lib/surface/call.cc","file_line":1069,"grpc_message":"Broken pipe","grpc_status":14}"

It seems like there are too many pings for the port?
Here is my yaml:

# Configuration file of FAR training experiment

# ========== Cluster configuration ========== 
# ip address of the parameter server (need 1 GPU process)
# ps_ip: 10.0.0.1
ps_ip:  0.0.0.0

# ip address of each worker:# of available gpus process on each gpu in this node
# Note that if we collocate ps and worker on same GPU, then we need to decrease this number of available processes on that GPU by 1
# E.g., master node has 4 available processes, then 1 for the ps, and worker should be set to: worker:3
worker_ips: 
    - 0.0.0.0:[4,4]
#    - 10.0.0.1:[4] # worker_ip: [(# processes on gpu) for gpu in available_gpus] eg. 10.0.0.2:[4,4,4,4] This node has 4 gpus, each gpu has 4 processes.

exp_path: /mnt/gaodawei.gdw/FedScale/fedscale/core

# Entry function of executor and aggregator under $exp_path
executor_entry: executor.py
aggregator_entry: aggregator.py

auth:
    ssh_user: "root"
    ssh_private_key: ~/.ssh/id_rsa

# cmd to run before we can indeed run FAR (in order)
setup_commands:
    - source /mnt/gaodawei.gdw/miniconda3/bin/activate fedscale
    - export NCCL_SOCKET_IFNAME='enp94s0f0'         # Run "ifconfig" to ensure the right NIC for nccl if you have multiple NICs

# ========== Additional job configuration ==========
# Default parameters are specified in argParser.py, wherein more description of the parameter can be found

job_conf: 
    - job_name: femnist                   # Generate logs under this folder: log_path/job_name/time_stamp
    - log_path: /mnt/gaodawei.gdw/FedScale/evals # Path of log files
    - total_worker: 2                      # Number of participants per round, we use K=100 in our paper, large K will be much slower
    - data_set: femnist                     # Dataset: openImg, google_speech, stackoverflow
    - data_dir: /mnt/gaodawei.gdw/FedScale/dataset/data/femnist    # Path of the dataset
    - data_map_file: /mnt/gaodawei.gdw/FedScale/dataset/data/femnist/client_data_mapping/train.csv     # Allocation of data to each client, turn to iid setting if not provided
    - device_conf_file: /mnt/gaodawei.gdw/FedScale/dataset/data/device_info/client_device_capacity     # Path of the client trace
    - device_avail_file: /mnt/gaodawei.gdw/FedScale/dataset/data/device_info/client_behave_trace
    - model: convnet2                            # Models: e.g., shufflenet_v2_x2_0, mobilenet_v2, resnet34, albert-base-v2
    - gradient_policy: fed-avg                 # {"fed-yogi", "fed-prox", "fed-avg"}, "fed-avg" by default
    - eval_interval: 10                     # How many rounds to run a testing on the testing set
    - epochs: 50                          # Number of rounds to run this training. We use 1000 in our paper, while it may converge w/ ~400 rounds
    - filter_less: 0                       # Remove clients w/ less than 21 samples
    - num_loaders: 8
    - yogi_eta: 3e-3 
    - yogi_tau: 1e-8
    - local_steps: 20
    - learning_rate: 0.01
    - batch_size: 20
    - test_bsz: 20
    - malicious_factor: 4
    - use_cuda: True
    - decay_factor: 1.  # decay factor of the learning rate
    - sample_seed: 12345
    - test_ratio: 1.
    - loss_decay: 0.
    - input_dim: 3
    - overcommitment: 1.5
    - use_cuda: False

Missing feedbacks when `- total_worker` is large

I'm trying to run the example of FEMNIST with a two-layer Conv network on a ubuntu server with 8 gpus.
The following is my yaml:

# Configuration file of FAR training experiment

# ========== Cluster configuration ========== 
# ip address of the parameter server (need 1 GPU process)
ps_ip: 0.0.0.0

# ip address of each worker:# of available gpus process on each gpu in this node
# Note that if we collocate ps and worker on same GPU, then we need to decrease this number of available processes on that GPU by 1
# E.g., master node has 4 available processes, then 1 for the ps, and worker should be set to: worker:3
worker_ips:
    - 0.0.0.0:[7] # worker_ip: [(# processes on gpu) for gpu in available_gpus] eg. 10.0.0.2:[4,4,4,4] This node has 4 gpus, each gpu has 4 processes.


exp_path: /mnt/gaodawei.gdw/FedScale/fedscale/core

# Entry function of executor and aggregator under $exp_path
executor_entry: executor.py

aggregator_entry: aggregator.py

auth:
    ssh_user: "root"
    ssh_private_key: ~/.ssh/id_rsa

# cmd to run before we can indeed run FAR (in order)
setup_commands:
    - source /mnt/gaodawei.gdw/miniconda3/bin/activate fedscale
    - export NCCL_SOCKET_IFNAME='enp94s0f0'         # Run "ifconfig" to ensure the right NIC for nccl if you have multiple NICs

# ========== Additional job configuration ==========
# Default parameters are specified in argParser.py, wherein more description of the parameter can be found

job_conf:
    - job_name: femnist                   # Generate logs under this folder: log_path/job_name/time_stamp
    - log_path: /mnt/gaodawei.gdw/FedScale/evals # Path of log files
    - total_worker: 100                      # Number of participants per round, we use K=100 in our paper, large K will be much slower
    - data_set: femnist                     # Dataset: openImg, google_speech, stackoverflow
    - data_dir: /mnt/gaodawei.gdw/FedScale/dataset/data/femnist    # Path of the dataset
    - data_map_file: /mnt/gaodawei.gdw/FedScale/dataset/data/femnist/client_data_mapping/train.csv              # Allocation of data to each client, turn to iid setting if not provided
    - device_conf_file: /mnt/gaodawei.gdw/FedScale/dataset/data/device_info/client_device_capacity     # Path of the client trace
    - device_avail_file: /mnt/gaodawei.gdw/FedScale/dataset/data/device_info/client_behave_trace
    - model: convnet2                            # Models: e.g., shufflenet_v2_x2_0, mobilenet_v2, resnet34, albert-base-v2
    - gradient_policy: fed-avg                 # {"fed-yogi", "fed-prox", "fed-avg"}, "fed-avg" by default
    - eval_interval: 100                     # How many rounds to run a testing on the testing set
    - epochs: 400                          # Number of rounds to run this training. We use 1000 in our paper, while it may converge w/ ~400 rounds
    - rounds: 400
    - filter_less: 0                       # Remove clients w/ less than 21 samples
    - num_loaders: 2
    - yogi_eta: 3e-3
    - yogi_tau: 1e-8
    - local_steps: 10
    - learning_rate: 0.05
    - decay_factor: 1.  # decay factor of the learning rate
    - batch_size: 20
    - test_bsz: 20
    - malicious_factor: 4
    - use_cuda: True
    - input_dim: 3
    - sample_seed: 12345
    - test_ratio: 1.
    - loss_decay: 0.

FedScale is blocked during training, so I add some logging in aggregator.py to monitor the running status as follows

    def event_monitor(self):
        logging.info("Start monitoring events ...")

        while True:
            # Broadcast events to clients
            ...

            # Handle events queued on the aggregator
            elif len(self.sever_events_queue) > 0:
                client_id, current_event, meta, data = self.sever_events_queue.popleft()
                logging.info("Receive event {} from client {}".format(current_event, client_id))
                if current_event == events.UPLOAD_MODEL:
                    self.client_completion_handler(self.deserialize_response(data))
                    #
                    logging.info("Currently {}/{} clients has finished training".format(len(self.stats_util_accumulator), self.tasks_round))
                    if len(self.stats_util_accumulator) == self.tasks_round:
                        self.round_completion_handler()

                elif current_event == events.MODEL_TEST:
                    self.testing_completion_handler(client_id, self.deserialize_response(data))

                else:
                    logging.error(f"Event {current_event} is not defined")
                
            else:
                # execute every 100 ms
                time.sleep(0.1)

and I obtain the following logs:

...
(05-26) 14:13:28 INFO     [client.py:17] Start to train (CLIENT: 2167) ...
(05-26) 14:13:28 INFO     [aggregator.py:628] Receive event upload_model from client 4
(05-26) 14:13:28 INFO     [aggregator.py:632] Currently 92/100 clients has finished training
(05-26) 14:13:28 INFO     [client.py:188] Training of (CLIENT: 585) completes, {'clientId': 585, 'moving_loss': 4.0690016746521, 'trained_size': 200, 'success': True, 'utility': 751.3793800173086}
(05-26) 14:13:28 INFO     [client.py:17] Start to train (CLIENT: 2305) ...
(05-26) 14:13:28 INFO     [aggregator.py:628] Receive event upload_model from client 5
(05-26) 14:13:28 INFO     [aggregator.py:632] Currently 93/100 clients has finished training
(05-26) 14:13:29 INFO     [client.py:188] Training of (CLIENT: 92) completes, {'clientId': 92, 'moving_loss': 4.070298886299133, 'trained_size': 200, 'success': True, 'utility': 740.9585433536305}
(05-26) 14:13:29 INFO     [aggregator.py:628] Receive event upload_model from client 3
(05-26) 14:13:29 INFO     [aggregator.py:632] Currently 94/100 clients has finished training
(05-26) 14:13:29 INFO     [client.py:188] Training of (CLIENT: 2305) completes, {'clientId': 2305, 'moving_loss': 4.075016450881958, 'trained_size': 200, 'success': True, 'utility': 521.201964075297}
(05-26) 14:13:29 INFO     [aggregator.py:628] Receive event upload_model from client 5
(05-26) 14:13:29 INFO     [aggregator.py:632] Currently 95/100 clients has finished training
(05-26) 14:13:31 INFO     [client.py:188] Training of (CLIENT: 1607) completes, {'clientId': 1607, 'moving_loss': 3.9613564729690554, 'trained_size': 200, 'success': True, 'utility': 322.7957952931622}
Traceback (most recent call last):
  File "/mnt/gaodawei.gdw/miniconda3/envs/fedscale/lib/python3.7/multiprocessing/queues.py", line 242, in _feed
    send_bytes(obj)
  File "/mnt/gaodawei.gdw/miniconda3/envs/fedscale/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/mnt/gaodawei.gdw/miniconda3/envs/fedscale/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes
    self._send(header + buf)
  File "/mnt/gaodawei.gdw/miniconda3/envs/fedscale/lib/python3.7/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
(05-26) 14:13:31 INFO     [client.py:188] Training of (CLIENT: 1821) completes, {'clientId': 1821, 'moving_loss': 3.9525305151939394, 'trained_size': 200, 'success': True, 'utility': 321.9902424194754}
(05-26) 14:13:31 INFO     [aggregator.py:628] Receive event upload_model from client 7
(05-26) 14:13:31 INFO     [aggregator.py:632] Currently 96/100 clients has finished training
(05-26) 14:13:31 INFO     [aggregator.py:628] Receive event upload_model from client 1
(05-26) 14:13:31 INFO     [aggregator.py:632] Currently 97/100 clients has finished training
(05-26) 14:13:33 INFO     [client.py:188] Training of (CLIENT: 137) completes, {'clientId': 137, 'moving_loss': 3.9952051639556885, 'trained_size': 200, 'success': True, 'utility': 643.8445367870678}
(05-26) 14:13:33 INFO     [aggregator.py:628] Receive event upload_model from client 2
(05-26) 14:13:33 INFO     [aggregator.py:632] Currently 98/100 clients has finished training
(05-26) 14:13:34 INFO     [client.py:188] Training of (CLIENT: 2167) completes, {'clientId': 2167, 'moving_loss': 4.092173397541046, 'trained_size': 200, 'success': True, 'utility': 821.8575235110941}
(05-26) 14:13:34 INFO     [aggregator.py:628] Receive event upload_model from client 4
(05-26) 14:13:34 INFO     [aggregator.py:632] Currently 99/100 clients has finished training

It seems like the server only get 99 models from the clients and the server continues to wait for the missing client.
I guess maybe it is related to the report BrokenPipeError?

Traceback (most recent call last):
  File "/mnt/gaodawei.gdw/miniconda3/envs/fedscale/lib/python3.7/multiprocessing/queues.py", line 242, in _feed
    send_bytes(obj)
  File "/mnt/gaodawei.gdw/miniconda3/envs/fedscale/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/mnt/gaodawei.gdw/miniconda3/envs/fedscale/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes
    self._send(header + buf)
  File "/mnt/gaodawei.gdw/miniconda3/envs/fedscale/lib/python3.7/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

So what can I do to deal with it?

ProcessGroupGloo error when running on more than one worker machine

Hi, I am trying to perform training based on the following config file for femnist dataset. I can run the experiment using two virtual machines. One as a parameter server and the other as a worker. However, if I increase the number of workers, let's say two workers, I run into the following error (please see the next comment).

Any thought on this?

Incorrect download address for Open Images for detection dataset

The address of "Open Image for detection" dataset is not correct, hence, the download of the dataset fails accordingly:

wget -O ${DIR}/open_images_detection.tar.gz https://fedscale.eecs.umich.edu/dataset/open_images_detection.tar.gz
echo "Dataset downloaded, now decompressing..."
tar -xf ${DIR}/open_images_detection.tar.gz -C ${DIR}
echo "Removing compressed file..."
rm -f ${DIR}/open_images_detection.tar.gz

client_data_mapping/train.csv Not Found

According to the example configuration file, e.g., Line 50 in core/evals/configs/openimage/conf.yml, we should have client_data_mapping/train.csv after downloading the OpenImage dataset after running the

bash download.sh -A

at dataset. However, I cannot find the csv file in either of the locations:

  • dataset/data/open_images;
  • dataset/open_images,

which led to the following exception in the log file when I try to execute python manager.py submit configs/openimage/conf.yml at core/evals:

...
Traceback (most recent call last):
  File ".../FedScale/core/examples/poisoning_setting/customized_executor.py", line 25, in <module>
    executor.run()
  File ".../FedScale/core/examples/poisoning_setting/../../executor.py", line 129, in run
    self.training_sets, self.testing_sets = self.init_data()
  File ".../FedScale/core/examples/poisoning_setting/../../executor.py", line 103, in init_data
    train_dataset, test_dataset = init_dataset()
  File ".../FedScale/core/examples/poisoning_setting/../../fllibs.py", line 211, in init_dataset
    train_dataset = OpenImage(args.data_dir, dataset='train', transform=train_transform)
  File ".../FedScale/core/examples/poisoning_setting/../../utils/openimage.py", line 59, in __init__
    self.data, self.targets = self.load_file(self.path)
  File ".../FedScale/core/examples/poisoning_setting/../../utils/openimage.py", line 120, in load_file
    datas, labels = self.load_meta_data(os.path.join(self.processed_folder, 'client_data_mapping', self.data_file+'.csv'))
  File ".../FedScale/core/examples/poisoning_setting/../../utils/openimage.py", line 106, in load_meta_data
    with open(path) as csv_file:
FileNotFoundError: [Errno 2] No such file or directory: '.../FedScale/dataset/open_images/client_data_mapping/train.csv'

Actually what is in the dataset/open_images are only:

$ ls dataset/open_images/
classTags  clientDataMap  test  train

and what is in the dataset/data/open_images are only:

$ ls dataset/data/open_images
clientDataMap  README.md

I therefore wonder if the download logic has changed and where I can find the train.csv. Could you please help me with this issue?

BatchNorm params not being collected / averaged properly across clients

The code is using model.parameters() to get all the learnable parameters from the clients and perform aggregation, e.g., in

model_param = [param.data.cpu().numpy() for param in model.parameters()]
.
However, model.parameters() does not include BN statistics such as running_mean, running_var, or num_batches_tracked, so I think the code is currently not aggregating those statistics across clients properly (please correct me if I am wrong).
To include BN statistics, you can instead use e.g., model.state_dict().items().
From my past experience, not aggregating BN statistics leads to poor accuracy, and you at least need to average them (how to really handle BN stats correctly is an ongoing area of research).

Found this while I was struggling to make FEMNIST benchmark work nicely.
Have you ran FEMNIST before and got decent accuracy with this codebase? If so, I might be misunderstanding how to use your code -- in that case, I would love to know exactly what config you used.
Otherwise, I think fixing this bug might solve my issue and #49.

Confusion on the Calculation of Model Update Size

At Line 228 of file core/aggregator.py, function run is going to calculate the model size in kbits.

I was wondering why is the divisor in the expression 1024 instead of 1000? According to my understanding, a bit rate is typically expressed in conjunction with an SI prefix (so 1 kbits/s = 1000 bits/s) instead of a Binary prefix (which specifies 1 Kibit/s = 1024 bits/s).

  • So if the client bandwidth information used in Oort and FedScale is in kbps which actually means kbits/s, such an inconsistency actually alienated the results from reality. (But it doesn't mean that your results are not making sense, as we can percept it as scaling up all the clients' bandwidth by a factor 1024/1000 which is pretty much close to 1 with presumably negligible side-effect.
  • On the other hand, if the client bandwidth information you used was actually in Kibit/s, then surely nothing goes wrong at all.

Anyway, you may consider changing the divisor to 1000 if the common consensus is that a bit rate is expressed with an SI prefix. This is particularly helpful if a user wants to use its own capacity dataset.

KeyError: 'get_server_event_que[X]' in executor.py

Description

The server_event_queue in the executor.py throws KeyError as self.control_manager.get_server_event_que'+str(self.this_rank) has not been initialized for second executor and the rest of them:

self.server_event_queue = eval('self.control_manager.get_server_event_que'+str(self.this_rank)+'()')

Environment to reproduce:

OS: Ubuntu 18.03 with bash as the default shell
Running on CPU with more than one worker

Logs

Logs and Error message:

(11-20) 02:28:59 INFO     [executor.py:142] Started GRPC server at [::]:50002
(11-20) 02:28:59 INFO     [executor.py:160] Successfully connect to the aggregator
Traceback (most recent call last):
  File "/home/ubuntu/FedScale/core/executor.py", line 349, in <module>
    executor.run()
  File "/home/ubuntu/FedScale/core/executor.py", line 204, in run
    self.setup_communication()
  File "/home/ubuntu/FedScale/core/executor.py", line 112, in setup_communication
    self.init_control_communication(self.args.ps_ip, self.args.manager_port)
  File "/home/ubuntu/FedScale/core/executor.py", line 162, in init_control_communication
    self.server_event_queue = eval('self.control_manager.get_server_event_que'+str(self.this_rank)+'()')
  File "<string>", line 1, in <module>
  File "/home/ubuntu/anaconda3/envs/fedscale/lib/python3.6/multiprocessing/managers.py", line 662, in temp
    token, exp = self._create(typeid, *args, **kwds)
  File "/home/ubuntu/anaconda3/envs/fedscale/lib/python3.6/multiprocessing/managers.py", line 556, in _create
    id, exposed = dispatch(conn, None, 'create', (typeid,)+args, kwds)
  File "/home/ubuntu/anaconda3/envs/fedscale/lib/python3.6/multiprocessing/managers.py", line 82, in dispatch
    raise convert_to_error(kind, result)
multiprocessing.managers.RemoteError: 
---------------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/fedscale/lib/python3.6/multiprocessing/managers.py", line 195, in handle_request
    result = func(c, *args, **kwds)
  File "/home/ubuntu/anaconda3/envs/fedscale/lib/python3.6/multiprocessing/managers.py", line 353, in create
    self.registry[typeid]
KeyError: 'get_server_event_que2'
---------------------------------------------------------------------------

Solution

The manager.py separates the worker using ;. In a Linux environment with bash set as the default shell the ; is interpreted as the end of a command. This causes a problem when running executor.py over ssh.

Overall Test Accuracy Calculation and Test Datasets

I have seen studies that use the aggregated model to evaluate the accuracy over a test dataset. However, in the aggregator.py, the overall test accuracy is calculated by averaging the test accuracy of each individual client's test accuracy:

FedScale/core/aggregator.py

Lines 398 to 399 in 33de86d

for key in accumulator:
accumulator[key] += self.test_result_accumulator[i][key]

I am wondering if these two approaches will end in the same result? Have you seen any other references use the same approach for calculating overall accuracy?

In addition to this, regardless of the participating clients that are chosen for training in each round (epoch), test accuracy is always evaluated on the dataset with executor rank id (self.this_rank):

data_loader = select_dataset(self.this_rank, self.testing_sets, batch_size=args.test_bsz, isTest=True, collate_fn=self.collate_fn)

I am wondering if these factors might have some effects on the low accuracy for the femnist experiment related to #57 and #49? Can you provide any references for the above cases?

Inconsistent gradient_policy variable values in example config files and core files

In the given example config files (e.g. core/evals/configs/openimage/conf.yml), the comments suggesting values for gradient_policy variables as fed-avg, fed-prox and fed-yogi, while different values are being used in the optimizer.py file. For instance:

elif self.mode =='qfedavg':

if conf.gradient_policy == 'prox':

if self.mode == 'yogi':

It is also mentioned the default value for gradient_policy is fed-avg, however, it does not seem this has been defined anywhere. Clarification is appreciated.

SentencePiece not Found in the Environment

When I ran Albert atop reddit, I came across with such an exception:

Traceback (most recent call last):
  File ".../FedScale/core/aggregator.py", line 509, in <module>
    aggregator.run()
  File ".../FedScale/core/aggregator.py", line 225, in run
    self.model = self.init_model()
  File ".../FedScale/core/aggregator.py", line 132, in init_model    return init_model()
  File ".../FedScale/core/fllibs.py", line 83, in init_model
    tokenizer = AlbertTokenizer.from_pretrained(args.model, do_lower_case=True)
  File ".../anaconda3/envs/fedscale/lib/python3.6/site-packages/transformers/utils/dummy_sentencepiece_objects.py", line 11, in from_pretrained
    requires_backends(cls, ["sentencepiece"])
  File ".../anaconda3/envs/fedscale/lib/python3.6/site-packages/transformers/file_utils.py", line 612, in requires_backends
    raise ImportError("".join([BACKENDS_MAPPING[backend][1].format(name) for backend in backends]))
ImportError: 
AlbertTokenizer requires the SentencePiece library but it was not found in your environment. Checkout the instructions on the
installation page of its repo: https://github.com/google/sentencepiece#installation and follow the ones
that match your environment.

which is resolved after I conducted

pip install sentencepiece

in the fedscale environment. I am thus wondering if you want to add this dependency into the environment.yml for friendly supporting users.

Running FEMNIST tutorial on local machine gives a few warnings.

  1. [W ParallelNative.cpp:229] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
  2. /Users/mosharaf/opt/anaconda3/envs/fedscale/lib/python3.7/site-packages/torchvision/transforms/functional_pil.py:42: DeprecationWarning: FLIP_LEFT_RIGHT is deprecated and will be removed in Pillow 10 (2023-07-01). Use Transpose.FLIP_LEFT_RIGHT instead. return img.transpose(Image.FLIP_LEFT_RIGHT)

Training seems to continue.

Delaying worker initializations

When running on a single host, around 10 worker initializations made in the following line fail. This is possibly due to exceeding the number of new concurrent connections to sshd (MaxStartups).

subprocess.Popen(f'ssh {submit_user}{worker} "{setup_cmd} {worker_cmd}"',

Waiting for the previous connections to succeed, just like it is done for the PS, has resolved this issue for me.

time.sleep(3)

This issue might not surface in multi-node setups due to network latencies inherently delaying requests for new connections, as opposed to single-node setups where the request never leaves the node (i.e. when relayed to 127.0.0.1).

Please consider adding a delay before initiating the next connection.

Add additional commands in manager.py

For example, status or job-status to see the status of the current job; view-config to see the configuration file used. We can discuss here to come up with a list of common things someone may need.

Need a better structure for eval config files (yml), especially job_conf

It's completely flat structure under job_conf which makes it hard to follow and organize. A better hierarchical organization under this will allow us, for example, to separate aggregator-related params from worker-related ones; the same for job-related params from trace-related params; etc.

Sampler Choice

Should the samplers be "random" and "oort", instead of "random and "kuiper" as is mentioned in the file?

FedScale/core/aggregator.py

Lines 129 to 142 in fe2eb88

def init_client_manager(self, args):
"""
Currently we implement two client managers:
1. Random client sampler
- it selects participants randomly in each round
- [Ref]: https://arxiv.org/abs/1902.01046
2. Kuiper sampler
- Kuiper prioritizes the use of those clients who have both data that offers the greatest utility
in improving model accuracy and the capability to run training quickly.
- [Ref]: https://arxiv.org/abs/2010.06081
"""
# sample_mode: random or kuiper
client_manager = clientManager(args.sample_mode, args=args)

Legacy in Tutorial

We notice there are some legacy issues in this tutorial. For example:

  • Please clone from this repo instead of AmberLJC/FedScale.git;
  • from fedscale instead of from FedScale?
  • Typo in Partition the dataset by the clientclient_data_mapping file

Can you please try to update it? Thanks!

Pull mechanism for client-server communication

FedScale is currently built under an assumption that the central server communicates with clients based on push mechanism where the server initiates signals (handshake/training/notTraining/etc) to available devices. Is it possible to consider a pull based system where the device initiates the communication by sending requests to the server to ask for next step actions on a periodic basis? In a realistic setting, the device would periodically ping the server when its training criteria is met (enough data, sufficient battery, app open, etc.), and the server would respond with the model gradients for federated training if the client is selected.

install.sh issues

The current install.sh assumes a lot of things. Until we have a very sophisticated version that auto-detects OS, GPU etc. and does a lot of the heavy lifting, a slightly improved approach could be giving specific instructions for what to do, where

  1. the user installs Anaconda on their own;
  2. then sets up conda environment using an yml;
  3. then optionally installs CUDA

environment.yml also has too many things to be installed that may only be needed for specific dataset/model combinations. Instead of forcing people to install everything, we can make dataset or model specific aspects part of the dataset download.sh part that will update the conda environment. This will avoid wholesale failure.

Backstory: I was trying to install FedScale on a pre-m1 macbook pro.

low accuracy

time_to_acc.pdf
I have run the femnist task with the following setting, but the accuracy has not changed after about 700 rounds. The train loss is around 15, any suggestions?

ps_ip: 127.0.0.1
worker_ips: 
- 127.0.0.1:[2] # worker_ip: [(# processes on gpu) for gpu in available_gpus] eg. 10.0.0.2:[4,4,4,4] This node has 4 gpus
exp_path: /opt/shared/Loss/newFedScale/core

executor_entry: executor.py
aggregator_entry: aggregator.py


auth:
     ssh_user: ""
    ssh_private_key: ~/.ssh/id_rsa
setup_commands:
    - source /opt/shared/anaconda3/bin/activate fedscale
    - export NCCL_SOCKET_IFNAME='enp94s0f0'         
job_conf: 
    - job_name: femnist                   # Generate logs under this folder: log_path/job_name/time_stamp
    - log_path: /opt/shared/Loss/newFedScale/core/evals # Path of log files
    - task: cv
    - total_worker: 17                    # Number of participants per round, we use K=100 in our paper, large K will be much slower
    - data_set: femnist                     # Dataset: openImg, google_speech, stackoverflow
    - data_dir: /opt/shared/Loss/newFedScale/dataset/data/femnist    # Path of the dataset
    - data_map_file: /opt/shared/Loss/newFedScale/dataset/data/femnist/client_data_mapping/train.csv              
    - device_conf_file: /opt/shared/Loss/newFedScale/dataset/data/device_info/client_device_capacity     # Path of the client trace
    - device_avail_file: /opt/shared/Loss/newFedScale/dataset/data/device_info/client_behave_trace
    - model: shufflenet_v2_x2_0                            # Models: e.g., shufflenet_v2_x2_0, mobilenet_v2, resnet34, albert-base-v2
    - gradient_policy: fed-avg                 # {"fed-yogi", "fed-prox", "fed-avg"}, "fed-avg" by default
    - eval_interval: 5                     # How many rounds to run a testing on the testing set
    - epochs: 1000                           # Number of rounds to run this training. We use 1000 in our paper, while it may converge w/ ~400 rounds
    - filter_less: 21                       # Remove clients w/ less than 21 samples
    - num_loaders: 2
    - yogi_eta: 3e-5 
    - yogi_tau: 1e-8
    - local_steps: 10
    - learning_rate: 0.05
    - batch_size: 2
    - test_bsz: 20

Dataloader segmentation fault

mosharaf@Mosharafs-MacBook-Pro:~/Work/FedScale/dataset|master ⇒  python
Python 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 05:57:50)
[Clang 11.1.0 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from torch.utils.data import DataLoader
[1]    26092 segmentation fault  python

FileNotFoundError in load_global_model function

Description

Occasionally, FileNotFoundError is thrown for the temp_model_path in the load_global_model function of the executor.py:

FedScale/core/executor.py

Lines 238 to 242 in 33de86d

def load_global_model(self):
# load last global model
with open(self.temp_model_path, 'rb') as model_in:
model = pickle.load(model_in)
return model

Environment

OS: Ubuntu 18.03 with bash as the default shell
Running on CPU with more than one worker

Error logs

(11-20) 15:03:22 INFO     [executor.py:326] Executor 2: Received (Event:TRAIN) from aggregator
Traceback (most recent call last):
  File "/home/ubuntu/FedScale/core/executor.py", line 349, in <module>
    executor.run()
  File "/home/ubuntu/FedScale/core/executor.py", line 205, in run
    self.event_monitor()
  File "/home/ubuntu/FedScale/core/executor.py", line 332, in event_monitor
    train_res = self.training_handler(clientId=clientId, conf=client_conf)
  File "/home/ubuntu/FedScale/core/executor.py", line 266, in training_handler
    client_model = self.load_global_model()
  File "/home/ubuntu/FedScale/core/executor.py", line 240, in load_global_model
    with open(self.temp_model_path, 'rb') as model_in:
FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/FedScale/core/evals/logs/femnist/1120_150258/executor/model_3.pth.tar'
Traceback (most recent call last):
  File "/home/ubuntu/FedScale/core/executor.py", line 349, in <module>
    executor.run()
  File "/home/ubuntu/FedScale/core/executor.py", line 205, in run
    self.event_monitor()
  File "/home/ubuntu/FedScale/core/executor.py", line 332, in event_monitor
    train_res = self.training_handler(clientId=clientId, conf=client_conf)
  File "/home/ubuntu/FedScale/core/executor.py", line 266, in training_handler
    client_model = self.load_global_model()
  File "/home/ubuntu/FedScale/core/executor.py", line 240, in load_global_model
    with open(self.temp_model_path, 'rb') as model_in:
FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/FedScale/core/evals/logs/femnist/1120_150258/executor/model_2.pth.tar'
(11-20) 15:03:23 INFO     [executor.py:326] Executor 1: Received (Event:TRAIN) from aggregator

Config file:

ps_ip: 10.30.72.19

worker_ips: 
    - 10.30.72.9:[1]
    - 10.30.72.29:[1]
    - 10.30.72.30:[1]
    # - 10.30.72.31:[1]
    # - 10.30.72.32:[1]
    # - 10.30.72.33:[1]
    # - 10.30.72.34:[1]

exp_path: $HOME/FedScale/core

executor_entry: executor.py

aggregator_entry: aggregator.py

auth:
    ssh_user: ""
    ssh_private_key: ~/.ssh/id_rsa

setup_commands:
    - source $HOME/anaconda3/bin/activate fedscale
    - export NCCL_SOCKET_IFNAME='eth0' 

job_conf: 
    - job_name: femnist
    - log_path: $HOME/FedScale/core/evals
    - total_worker: 4
    - data_set: femnist
    - data_dir: $HOME/FedScale/dataset/data/femnist
    - data_map_file: $HOME/FedScale/dataset/data/femnist/client_data_mapping/train.csv
    - gradient_policy: yogi
    - eval_interval: 30
    - epochs: 3
    - filter_less: 21
    - num_loaders: 1
    - yogi_eta: 3e-3 
    - yogi_tau: 1e-8
    - local_steps: 20
    - learning_rate: 0.05
    - batch_size: 20
    - test_bsz: 20
    - malicious_factor: 4

$FEDSCALE_HOME / $FS_HOME environment variable

In many places, FedScale uses $HOME assuming that a user has installed FedScale in their $HOME directory. Instead, it should have a $FEDSCALE_HOME variable or something like that and then have paths relative to that to avoid scripts silently failing.

Basic tutorial shouldn't use openimage dataset

It's too large for anyone to have patience; may even be prohibitive in terms of space on someone's laptop. It should be based on a smaller dataset like FEMNIST.

There is a lot of repetitions between different tutorials that doesn't make sense either.

Request for a Complete Version of AI Benchmark

According to my observation, the client system behavior dataset that FedScale's examples currently rely on, i.e., the file dataset/data/client_device_capacity, is a pickled Python dictionary containing 500,000 keys, each of which represents a client and is only associated with two values: an inference speed and a network bandwidth. In other words, it is both model-agnostic and data-agnostic when it comes to estimate the training speed—no matter (1) how large the ML model is or (2) how large a data sample is, the estimated time for training over a mini-batch of data always stays the same as long as the batch size doe not vary, which is counterfactual.

Hence, I am thinking that client_device_capacity is merely a small subset of the AI Benchmark dataset that FedScale claims to embrace. According to Sec. 3.2 in the paper,

AI Benchmark provides the training and inference speed of diverse models (e.g., MobileNet) across a wide range of device models.

I hereby request a complete version of AI Benchmark for gaining more confidence with the time-related metrics reported by FedScale. Looking forward to your help!

Warning: Leaking Caffe2 thread-pool after fork

Hi Authors,

I notice that environment.yml has undergone a change last month. Particularly, in the last old version, it requires torch==1.1.0 and torchvision==0.4.0. As torchvision==0.4.0 actually requires torch 1.2.0, an environment installation based on this version of `environment.yml' finally ends up with

torch==1.2.0
torchvision==0.4.0

While in the latest version, it does not specify the version of torch and torchvision, which should refers to the currently stable release. As of today, the installation based on this version of environment.yml yields

torch==1.9.0+cu102
torchvision==0.10.0+cu102

Definitely, it would be great to see FedScale being able to run over the latest version of torch and torchvision. However, in practice, I found that the latest version of torch and/or torchvision will bring many many warning messages like

[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)

to the log file when I perform any training task. This is quite space-consuming and my current walkaround method is to downgrade the version of torch and torchvision to

torch==1.2.0
torchvision==0.4.0

This is inspired by the related discussion on this site. I am curious about whether you have more decent walkaround of this issue. Also, let me know if you think my solution is a general one and I can create/extend a corresponding PR ;)

Missing FEMNIST dataset client data mapping files

Trying to run experiments using femnist dataset, a problem arises as a result of missing the following files:

  • train_img_to_path
  • train_img_to_target
  • test_img_to_path
  • test_img_to_target

with open(os.path.join(path, 'train_img_to_path'), 'rb') as f:
rawImg = pickle.load(f)
rawPath = pickle.load(f)
with open(os.path.join(path, 'train_img_to_target'), 'rb') as f:
rawTags = pickle.load(f)
else:
with open(os.path.join(path, 'test_img_to_path'), 'rb') as f:
rawImg = pickle.load(f)
rawPath = pickle.load(f)
with open(os.path.join(path, 'test_img_to_target'), 'rb') as f:
rawTags = pickle.load(f)

ClientOptimizer.update_client_weight arguments inconsistency

In client.py, the update_client_weight function is called by giving conf.gradient_policy as an argument.

self.optimizer. update_client_weight(conf.gradient_policy, model , global_model if global_model is not None else None )

However, in the update_client_weight definition, conf refers back to the configuration parameters:

def update_client_weight(self , conf ,model, global_model = None):
if self.mode == 'prox':
for idx, param in enumerate(model.parameters()):
param.data += conf.learning_rate * conf.proxy_mu * (param.data - global_model[idx])

Inconsistency in the dataset directory

  1. The README says 20 datasets, the download script has 16 or so, the data directory has 15 or 16.
  2. Naming of the datasets are inconsistent too; e.g., iNature vs iNaturalist
  3. Using a single letter in the download script is also confusing and short-sighted. There may be more datasets than letters in the alphabet. A convention would be --dataset-name

Bug in qfedavg

hs += (qfedq * np.float_power(loss+1e-10, (qfedq-1)) * torch.norm(grads, 2) + (1.0/learning_rate) * np.float_power(loss+1e-10, qfedq))

I am not sure I got this bug when running this line

AttributeError: 'list' object has no attribute 'dim'

And I change it to
hs += (qfedq * np.float_power(loss+1e-10, (qfedq-1)) * torch.norm(torch.stack(([torch.norm( g, 2) for g in grads]))) + (1.0/learning_rate) * np.float_power(loss+1e-10, qfedq))

Add Tensorflow support

Ideal solution would be native support. If that's too difficult, as a first step, we should add a tutorial.

Documentation site

Creating a placeholder to keep track of documentation status. There is no other obvious place.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.