symbioticlab / fedscale Goto Github PK

Line 224 in fe2eb88

self.model = client_model

while the comment right above it is

Line 223 in fe2eb88

# we need to get runtime variance for BN

So it seems indispensable to have Line 224, but I am not sure how it works. I have tried to remove Line 224, and the training slows down a lot. What on earth is going on with Line 224? How is it related to Batch Normalization?

Add trace formats for device traces

Include a README in the folder with the format of both traces so that people can use their own trace or contribute new traces.

ProcessGroupGloo error when running on more than one worker machine

Hi, I am trying to perform training based on the following config file for femnist dataset. I can run the experiment using two virtual machines. One as a parameter server and the other as a worker. However, if I increase the number of workers, let's say two workers, I run into the following error (please see the next comment).

Any thought on this?

ClientOptimizer.mode is not defined

In the optimizer.py, the ClientOptimizer class does not have an attribute named mode, however, it has been referenced in update_client_weight function:

FedScale/dataset/download.sh

Line 57 in 26b83da

if self.mode == 'prox':

Incorrect download address for Open Images for detection dataset

The address of "Open Image for detection" dataset is not correct, hence, the download of the dataset fails accordingly:

Lines 194 to 200 in 5a691b9

 wget -O ${DIR}/open_images_detection.tar.gz https://fedscale.eecs.umich.edu/dataset/open_images_detection.tar.gz 

 echo "Dataset downloaded, now decompressing..." 

 tar -xf ${DIR}/open_images_detection.tar.gz -C ${DIR} 

 echo "Removing compressed file..." 

 rm -f ${DIR}/open_images_detection.tar.gz

client_data_mapping/train.csv Not Found

According to the example configuration file, e.g., Line 50 in core/evals/configs/openimage/conf.yml, we should have client_data_mapping/train.csv after downloading the OpenImage dataset after running the

bash download.sh -A

at dataset. However, I cannot find the csv file in either of the locations:

dataset/data/open_images;
dataset/open_images,

which led to the following exception in the log file when I try to execute python manager.py submit configs/openimage/conf.yml at core/evals:

...
Traceback (most recent call last):
  File ".../FedScale/core/examples/poisoning_setting/customized_executor.py", line 25, in <module>
    executor.run()
  File ".../FedScale/core/examples/poisoning_setting/../../executor.py", line 129, in run
    self.training_sets, self.testing_sets = self.init_data()
  File ".../FedScale/core/examples/poisoning_setting/../../executor.py", line 103, in init_data
    train_dataset, test_dataset = init_dataset()
  File ".../FedScale/core/examples/poisoning_setting/../../fllibs.py", line 211, in init_dataset
    train_dataset = OpenImage(args.data_dir, dataset='train', transform=train_transform)
  File ".../FedScale/core/examples/poisoning_setting/../../utils/openimage.py", line 59, in __init__
    self.data, self.targets = self.load_file(self.path)
  File ".../FedScale/core/examples/poisoning_setting/../../utils/openimage.py", line 120, in load_file
    datas, labels = self.load_meta_data(os.path.join(self.processed_folder, 'client_data_mapping', self.data_file+'.csv'))
  File ".../FedScale/core/examples/poisoning_setting/../../utils/openimage.py", line 106, in load_meta_data
    with open(path) as csv_file:
FileNotFoundError: [Errno 2] No such file or directory: '.../FedScale/dataset/open_images/client_data_mapping/train.csv'

Actually what is in the dataset/open_images are only:

$ ls dataset/open_images/
classTags  clientDataMap  test  train

and what is in the dataset/data/open_images are only:

$ ls dataset/data/open_images
clientDataMap  README.md

I therefore wonder if the download logic has changed and where I can find the train.csv. Could you please help me with this issue?

BatchNorm params not being collected / averaged properly across clients

The code is using model.parameters() to get all the learnable parameters from the clients and perform aggregation, e.g., in

FedScale/core/client.py

Line 179 in 33de86d

model_param = [param.data.cpu().numpy() for param in model.parameters()]

.
However, model.parameters() does not include BN statistics such as running_mean, running_var, or num_batches_tracked, so I think the code is currently not aggregating those statistics across clients properly (please correct me if I am wrong).
To include BN statistics, you can instead use e.g., model.state_dict().items().
From my past experience, not aggregating BN statistics leads to poor accuracy, and you at least need to average them (how to really handle BN stats correctly is an ongoing area of research).

Found this while I was struggling to make FEMNIST benchmark work nicely.
Have you ran FEMNIST before and got decent accuracy with this codebase? If so, I might be misunderstanding how to use your code -- in that case, I would love to know exactly what config you used.
Otherwise, I think fixing this bug might solve my issue and #49.

Confusion on the Calculation of Model Update Size

At Line 228 of file core/aggregator.py, function run is going to calculate the model size in kbits.

I was wondering why is the divisor in the expression 1024 instead of 1000? According to my understanding, a bit rate is typically expressed in conjunction with an SI prefix (so 1 kbits/s = 1000 bits/s) instead of a Binary prefix (which specifies 1 Kibit/s = 1024 bits/s).

So if the client bandwidth information used in Oort and FedScale is in kbps which actually means kbits/s, such an inconsistency actually alienated the results from reality. (But it doesn't mean that your results are not making sense, as we can percept it as scaling up all the clients' bandwidth by a factor 1024/1000 which is pretty much close to 1 with presumably negligible side-effect.
On the other hand, if the client bandwidth information you used was actually in Kibit/s, then surely nothing goes wrong at all.

Anyway, you may consider changing the divisor to 1000 if the common consensus is that a bit rate is expressed with an SI prefix. This is particularly helpful if a user wants to use its own capacity dataset.

KeyError: 'get_server_event_que[X]' in executor.py

Description

The server_event_queue in the executor.py throws KeyError as self.control_manager.get_server_event_que'+str(self.this_rank) has not been initialized for second executor and the rest of them:

Line 162 in 501f91c

 self.server_event_queue = eval('self.control_manager.get_server_event_que'+str(self.this_rank)+'()') 

Environment to reproduce:

OS: Ubuntu 18.03 with bash as the default shell
Running on CPU with more than one worker

Logs

Logs and Error message:

(11-20) 02:28:59 INFO     [executor.py:142] Started GRPC server at [::]:50002
(11-20) 02:28:59 INFO     [executor.py:160] Successfully connect to the aggregator
Traceback (most recent call last):
  File "/home/ubuntu/FedScale/core/executor.py", line 349, in <module>
    executor.run()
  File "/home/ubuntu/FedScale/core/executor.py", line 204, in run
    self.setup_communication()
  File "/home/ubuntu/FedScale/core/executor.py", line 112, in setup_communication
    self.init_control_communication(self.args.ps_ip, self.args.manager_port)
  File "/home/ubuntu/FedScale/core/executor.py", line 162, in init_control_communication
    self.server_event_queue = eval('self.control_manager.get_server_event_que'+str(self.this_rank)+'()')
  File "<string>", line 1, in <module>
  File "/home/ubuntu/anaconda3/envs/fedscale/lib/python3.6/multiprocessing/managers.py", line 662, in temp
    token, exp = self._create(typeid, *args, **kwds)
  File "/home/ubuntu/anaconda3/envs/fedscale/lib/python3.6/multiprocessing/managers.py", line 556, in _create
    id, exposed = dispatch(conn, None, 'create', (typeid,)+args, kwds)
  File "/home/ubuntu/anaconda3/envs/fedscale/lib/python3.6/multiprocessing/managers.py", line 82, in dispatch
    raise convert_to_error(kind, result)
multiprocessing.managers.RemoteError: 
---------------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/fedscale/lib/python3.6/multiprocessing/managers.py", line 195, in handle_request
    result = func(c, *args, **kwds)
  File "/home/ubuntu/anaconda3/envs/fedscale/lib/python3.6/multiprocessing/managers.py", line 353, in create
    self.registry[typeid]
KeyError: 'get_server_event_que2'
---------------------------------------------------------------------------

Solution

The manager.py separates the worker using ;. In a Linux environment with bash set as the default shell the ; is interpreted as the end of a command. This causes a problem when running executor.py over ssh.

Overall Test Accuracy Calculation and Test Datasets

I have seen studies that use the aggregated model to evaluate the accuracy over a test dataset. However, in the aggregator.py, the overall test accuracy is calculated by averaging the test accuracy of each individual client's test accuracy:

Lines 398 to 399 in 33de86d

 for key in accumulator: 

 accumulator[key] += self.test_result_accumulator[i][key]

I am wondering if these two approaches will end in the same result? Have you seen any other references use the same approach for calculating overall accuracy?

In addition to this, regardless of the participating clients that are chosen for training in each round (epoch), test accuracy is always evaluated on the dataset with executor rank id (self.this_rank):

Line 294 in 33de86d

 data_loader = select_dataset(self.this_rank, self.testing_sets, batch_size=args.test_bsz, isTest=True, collate_fn=self.collate_fn) 

I am wondering if these factors might have some effects on the low accuracy for the femnist experiment related to #57 and #49? Can you provide any references for the above cases?

Inconsistent gradient_policy variable values in example config files and core files

In the given example config files (e.g. core/evals/configs/openimage/conf.yml), the comments suggesting values for gradient_policy variables as fed-avg, fed-prox and fed-yogi, while different values are being used in the optimizer.py file. For instance:

Line 27 in fefd06d

elif self.mode =='qfedavg':

Line 58 in fefd06d

if conf.gradient_policy == 'prox':

https://aml.engr.tamu.edu/book-dswe/dswe-datasets/

Line 16 in fefd06d

if self.mode == 'yogi':

It is also mentioned the default value for gradient_policy is fed-avg, however, it does not seem this has been defined anywhere. Clarification is appreciated.

Potential dataset

@dywsjtu can you look it up? Most of them are really small, but the 4. Wind Spatio-Temporal Dataset2 can be useful with 200 clients.

We can discuss here.

SentencePiece not Found in the Environment

When I ran Albert atop reddit, I came across with such an exception:

Traceback (most recent call last):
  File ".../FedScale/core/aggregator.py", line 509, in <module>
    aggregator.run()
  File ".../FedScale/core/aggregator.py", line 225, in run
    self.model = self.init_model()
  File ".../FedScale/core/aggregator.py", line 132, in init_model    return init_model()
  File ".../FedScale/core/fllibs.py", line 83, in init_model
    tokenizer = AlbertTokenizer.from_pretrained(args.model, do_lower_case=True)
  File ".../anaconda3/envs/fedscale/lib/python3.6/site-packages/transformers/utils/dummy_sentencepiece_objects.py", line 11, in from_pretrained
    requires_backends(cls, ["sentencepiece"])
  File ".../anaconda3/envs/fedscale/lib/python3.6/site-packages/transformers/file_utils.py", line 612, in requires_backends
    raise ImportError("".join([BACKENDS_MAPPING[backend][1].format(name) for backend in backends]))
ImportError: 
AlbertTokenizer requires the SentencePiece library but it was not found in your environment. Checkout the instructions on the
installation page of its repo: https://github.com/google/sentencepiece#installation and follow the ones
that match your environment.

which is resolved after I conducted

pip install sentencepiece

in the fedscale environment. I am thus wondering if you want to add this dependency into the environment.yml for friendly supporting users.

Running FEMNIST tutorial on local machine gives a few warnings.

[W ParallelNative.cpp:229] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
/Users/mosharaf/opt/anaconda3/envs/fedscale/lib/python3.7/site-packages/torchvision/transforms/functional_pil.py:42: DeprecationWarning: FLIP_LEFT_RIGHT is deprecated and will be removed in Pillow 10 (2023-07-01). Use Transpose.FLIP_LEFT_RIGHT instead. return img.transpose(Image.FLIP_LEFT_RIGHT)

Training seems to continue.

small bug aggregator.py

FedScale/core/evals/manager.py

Line 333 in 24ffc1b

self.param.data = last_model[idx] - Deltas[idx]/(hs+1e-10)

self.param should be param?

Delaying worker initializations

When running on a single host, around 10 worker initializations made in the following line fail. This is possibly due to exceeding the number of new concurrent connections to sshd (MaxStartups).

Line 87 in 2af119c

subprocess.Popen(f'ssh {submit_user}{worker} "{setup_cmd} {worker_cmd}"',

Waiting for the previous connections to succeed, just like it is done for the PS, has resolved this issue for me.

FedScale/core/evals/manager.py

Line 74 in 2af119c

time.sleep(3)

This issue might not surface in multi-node setups due to network latencies inherently delaying requests for new connections, as opposed to single-node setups where the request never leaves the node (i.e. when relayed to 127.0.0.1).

Please consider adding a delay before initiating the next connection.

Add additional commands in manager.py

For example, status or job-status to see the status of the current job; view-config to see the configuration file used. We can discuss here to come up with a list of common things someone may need.

Missing FEMNIST dataset client data mapping files

The following data is still missing after re-downloading just now.

train_img_to_path
train_img_to_target
test_img_to_path
test_img_to_target

Need a better structure for eval config files (yml), especially job_conf

It's completely flat structure under job_conf which makes it hard to follow and organize. A better hierarchical organization under this will allow us, for example, to separate aggregator-related params from worker-related ones; the same for job-related params from trace-related params; etc.

Missing Stackoverflow preprocessed data

The preprocessed Stackoverflow dataset is missing in https://github.com/SymbioticLab/FedScale/tree/master/dataset/data/stackoverflow.

Sampler Choice

Should the samplers be "random" and "oort", instead of "random and "kuiper" as is mentioned in the file?

Lines 129 to 142 in fe2eb88

 def init_client_manager(self, args): 

 """ 

  Currently we implement two client managers: 

  1. Random client sampler 

  - it selects participants randomly in each round 

  - [Ref]: https://arxiv.org/abs/1902.01046 

  2. Kuiper sampler 

  - Kuiper prioritizes the use of those clients who have both data that offers the greatest utility 

  in improving model accuracy and the capability to run training quickly. 

  - [Ref]: https://arxiv.org/abs/2010.06081 

  """ 

 # sample_mode: random or kuiper 

 client_manager = clientManager(args.sample_mode, args=args)

Legacy in Tutorial

We notice there are some legacy issues in this tutorial. For example:

Please clone from this repo instead of AmberLJC/FedScale.git;
from fedscale instead of from FedScale?
Typo in Partition the dataset by the clientclient_data_mapping file

Can you please try to update it? Thanks!

Pull mechanism for client-server communication

FedScale is currently built under an assumption that the central server communicates with clients based on push mechanism where the server initiates signals (handshake/training/notTraining/etc) to available devices. Is it possible to consider a pull based system where the device initiates the communication by sending requests to the server to ask for next step actions on a periodic basis? In a realistic setting, the device would periodically ping the server when its training criteria is met (enough data, sufficient battery, app open, etc.), and the server would respond with the model gradients for federated training if the client is selected.

Install dataset-specific dependencies when downloading that dataset

This will make initial conda setup faster (basically, delete the dependency from the environment.yml), and people can avoid installing unnecessary packages.

install.sh issues

The current install.sh assumes a lot of things. Until we have a very sophisticated version that auto-detects OS, GPU etc. and does a lot of the heavy lifting, a slightly improved approach could be giving specific instructions for what to do, where

the user installs Anaconda on their own;
then sets up conda environment using an yml;
then optionally installs CUDA

environment.yml also has too many things to be installed that may only be needed for specific dataset/model combinations. Instead of forcing people to install everything, we can make dataset or model specific aspects part of the dataset download.sh part that will update the conda environment. This will avoid wholesale failure.

Backstory: I was trying to install FedScale on a pre-m1 macbook pro.

low accuracy

time_to_acc.pdf
I have run the femnist task with the following setting, but the accuracy has not changed after about 700 rounds. The train loss is around 15, any suggestions?

ps_ip: 127.0.0.1
worker_ips: 
- 127.0.0.1:[2] # worker_ip: [(# processes on gpu) for gpu in available_gpus] eg. 10.0.0.2:[4,4,4,4] This node has 4 gpus
exp_path: /opt/shared/Loss/newFedScale/core

executor_entry: executor.py
aggregator_entry: aggregator.py


auth:
     ssh_user: ""
    ssh_private_key: ~/.ssh/id_rsa
setup_commands:
    - source /opt/shared/anaconda3/bin/activate fedscale
    - export NCCL_SOCKET_IFNAME='enp94s0f0'         
job_conf: 
    - job_name: femnist                   # Generate logs under this folder: log_path/job_name/time_stamp
    - log_path: /opt/shared/Loss/newFedScale/core/evals # Path of log files
    - task: cv
    - total_worker: 17                    # Number of participants per round, we use K=100 in our paper, large K will be much slower
    - data_set: femnist                     # Dataset: openImg, google_speech, stackoverflow
    - data_dir: /opt/shared/Loss/newFedScale/dataset/data/femnist    # Path of the dataset
    - data_map_file: /opt/shared/Loss/newFedScale/dataset/data/femnist/client_data_mapping/train.csv              
    - device_conf_file: /opt/shared/Loss/newFedScale/dataset/data/device_info/client_device_capacity     # Path of the client trace
    - device_avail_file: /opt/shared/Loss/newFedScale/dataset/data/device_info/client_behave_trace
    - model: shufflenet_v2_x2_0                            # Models: e.g., shufflenet_v2_x2_0, mobilenet_v2, resnet34, albert-base-v2
    - gradient_policy: fed-avg                 # {"fed-yogi", "fed-prox", "fed-avg"}, "fed-avg" by default
    - eval_interval: 5                     # How many rounds to run a testing on the testing set
    - epochs: 1000                           # Number of rounds to run this training. We use 1000 in our paper, while it may converge w/ ~400 rounds
    - filter_less: 21                       # Remove clients w/ less than 21 samples
    - num_loaders: 2
    - yogi_eta: 3e-5 
    - yogi_tau: 1e-8
    - local_steps: 10
    - learning_rate: 0.05
    - batch_size: 2
    - test_bsz: 20

Dataloader segmentation fault

mosharaf@Mosharafs-MacBook-Pro:~/Work/FedScale/dataset|master ⇒  python
Python 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 05:57:50)
[Clang 11.1.0 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from torch.utils.data import DataLoader
[1]    26092 segmentation fault  python

qfedavg not converge

Anyone can take a look at qfedavg or have run the new version of qfedavg?
https://github.com/SymbioticLab/FedScale/blob/master/core/optimizer.py#L26

I set q=1 and it never converges as before, which is the wrong version though.
But, I cannot find any reason why that fails.

FileNotFoundError in load_global_model function

Description

Occasionally, FileNotFoundError is thrown for the temp_model_path in the load_global_model function of the executor.py:

FedScale/core/utils/yogi.py

Lines 238 to 242 in 33de86d

 def load_global_model(self): 

 # load last global model 

 with open(self.temp_model_path, 'rb') as model_in: 

 model = pickle.load(model_in) 

 return model

Environment

OS: Ubuntu 18.03 with bash as the default shell
Running on CPU with more than one worker

Error logs

(11-20) 15:03:22 INFO     [executor.py:326] Executor 2: Received (Event:TRAIN) from aggregator
Traceback (most recent call last):
  File "/home/ubuntu/FedScale/core/executor.py", line 349, in <module>
    executor.run()
  File "/home/ubuntu/FedScale/core/executor.py", line 205, in run
    self.event_monitor()
  File "/home/ubuntu/FedScale/core/executor.py", line 332, in event_monitor
    train_res = self.training_handler(clientId=clientId, conf=client_conf)
  File "/home/ubuntu/FedScale/core/executor.py", line 266, in training_handler
    client_model = self.load_global_model()
  File "/home/ubuntu/FedScale/core/executor.py", line 240, in load_global_model
    with open(self.temp_model_path, 'rb') as model_in:
FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/FedScale/core/evals/logs/femnist/1120_150258/executor/model_3.pth.tar'
Traceback (most recent call last):
  File "/home/ubuntu/FedScale/core/executor.py", line 349, in <module>
    executor.run()
  File "/home/ubuntu/FedScale/core/executor.py", line 205, in run
    self.event_monitor()
  File "/home/ubuntu/FedScale/core/executor.py", line 332, in event_monitor
    train_res = self.training_handler(clientId=clientId, conf=client_conf)
  File "/home/ubuntu/FedScale/core/executor.py", line 266, in training_handler
    client_model = self.load_global_model()
  File "/home/ubuntu/FedScale/core/executor.py", line 240, in load_global_model
    with open(self.temp_model_path, 'rb') as model_in:
FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/FedScale/core/evals/logs/femnist/1120_150258/executor/model_2.pth.tar'
(11-20) 15:03:23 INFO     [executor.py:326] Executor 1: Received (Event:TRAIN) from aggregator

Config file:

ps_ip: 10.30.72.19

worker_ips: 
    - 10.30.72.9:[1]
    - 10.30.72.29:[1]
    - 10.30.72.30:[1]
    # - 10.30.72.31:[1]
    # - 10.30.72.32:[1]
    # - 10.30.72.33:[1]
    # - 10.30.72.34:[1]

exp_path: $HOME/FedScale/core

executor_entry: executor.py

aggregator_entry: aggregator.py

auth:
    ssh_user: ""
    ssh_private_key: ~/.ssh/id_rsa

setup_commands:
    - source $HOME/anaconda3/bin/activate fedscale
    - export NCCL_SOCKET_IFNAME='eth0' 

job_conf: 
    - job_name: femnist
    - log_path: $HOME/FedScale/core/evals
    - total_worker: 4
    - data_set: femnist
    - data_dir: $HOME/FedScale/dataset/data/femnist
    - data_map_file: $HOME/FedScale/dataset/data/femnist/client_data_mapping/train.csv
    - gradient_policy: yogi
    - eval_interval: 30
    - epochs: 3
    - filter_less: 21
    - num_loaders: 1
    - yogi_eta: 3e-3 
    - yogi_tau: 1e-8
    - local_steps: 20
    - learning_rate: 0.05
    - batch_size: 20
    - test_bsz: 20
    - malicious_factor: 4

$FEDSCALE_HOME / $FS_HOME environment variable

In many places, FedScale uses $HOME assuming that a user has installed FedScale in their $HOME directory. Instead, it should have a $FEDSCALE_HOME variable or something like that and then have paths relative to that to avoid scripts silently failing.

Basic tutorial shouldn't use openimage dataset

It's too large for anyone to have patience; may even be prohibitive in terms of space on someone's laptop. It should be based on a smaller dataset like FEMNIST.

There is a lot of repetitions between different tutorials that doesn't make sense either.

Yogi problem

Line 23 in fe2eb88

 self.v_t[idx] = self.v_t[idx] - (1.-self.beta2) * gradient_square * torch.sign(self.v_t[idx] - gradient_square) 

Should the gradient_square be the one compute with β1 based on the old gradients?

Algorithm2 Line12 https://arxiv.org/pdf/2003.00295.pdf

Better organization needed in the core directory

Everything is basically in a flat hierarchy and hard to figure out who might be related to each other without going through the code.

Keep Jupyter notebooks for the tutorials but still have .md files to describe it concisely

Some of the tutorials are in colab only and some in jupyter only; ALL should be in md with links as needed. Having a concise markdown version is easier to skim through, whereas a notebook (jupyter or colab) has too much unnecessary stuff.

@dywsjtu can keep an eye on this when setting up the docs.

Request for a Complete Version of AI Benchmark

According to my observation, the client system behavior dataset that FedScale's examples currently rely on, i.e., the file dataset/data/client_device_capacity, is a pickled Python dictionary containing 500,000 keys, each of which represents a client and is only associated with two values: an inference speed and a network bandwidth. In other words, it is both model-agnostic and data-agnostic when it comes to estimate the training speed—no matter (1) how large the ML model is or (2) how large a data sample is, the estimated time for training over a mini-batch of data always stays the same as long as the batch size doe not vary, which is counterfactual.

Hence, I am thinking that client_device_capacity is merely a small subset of the AI Benchmark dataset that FedScale claims to embrace. According to Sec. 3.2 in the paper,

AI Benchmark provides the training and inference speed of diverse models (e.g., MobileNet) across a wide range of device models.

I hereby request a complete version of AI Benchmark for gaining more confidence with the time-related metrics reported by FedScale. Looking forward to your help!

Warning: Leaking Caffe2 thread-pool after fork

Hi Authors,

I notice that environment.yml has undergone a change last month. Particularly, in the last old version, it requires torch==1.1.0 and torchvision==0.4.0. As torchvision==0.4.0 actually requires torch 1.2.0, an environment installation based on this version of `environment.yml' finally ends up with

torch==1.2.0
torchvision==0.4.0

While in the latest version, it does not specify the version of torch and torchvision, which should refers to the currently stable release. As of today, the installation based on this version of environment.yml yields

torch==1.9.0+cu102
torchvision==0.10.0+cu102

Definitely, it would be great to see FedScale being able to run over the latest version of torch and torchvision. However, in practice, I found that the latest version of torch and/or torchvision will bring many many warning messages like

[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)

to the log file when I perform any training task. This is quite space-consuming and my current walkaround method is to downgrade the version of torch and torchvision to

torch==1.2.0
torchvision==0.4.0

This is inspired by the related discussion on this site. I am curious about whether you have more decent walkaround of this issue. Also, let me know if you think my solution is a general one and I can create/extend a corresponding PR ;)

Use a proper logging facility instead of print

For example https://docs.python.org/3/library/logging.html

Missing FEMNIST dataset client data mapping files

Trying to run experiments using femnist dataset, a problem arises as a result of missing the following files:

train_img_to_path
train_img_to_target
test_img_to_path
test_img_to_target

FedScale/core/utils/femnist.py

Lines 109 to 121 in 26b83da

 with open(os.path.join(path, 'train_img_to_path'), 'rb') as f: 

 rawImg = pickle.load(f) 

 rawPath = pickle.load(f) 

 with open(os.path.join(path, 'train_img_to_target'), 'rb') as f: 

 rawTags = pickle.load(f) 

 else: 

 with open(os.path.join(path, 'test_img_to_path'), 'rb') as f: 

 rawImg = pickle.load(f) 

 rawPath = pickle.load(f) 

 with open(os.path.join(path, 'test_img_to_target'), 'rb') as f: 

 rawTags = pickle.load(f)

ClientOptimizer.update_client_weight arguments inconsistency

In client.py, the update_client_weight function is called by giving conf.gradient_policy as an argument.

FedScale/core/client.py

Line 157 in 26b83da

 self.optimizer. update_client_weight(conf.gradient_policy, model , global_model if global_model is not None else None ) 

However, in the update_client_weight definition, conf refers back to the configuration parameters:

Lines 56 to 59 in 26b83da

 def update_client_weight(self , conf ,model, global_model = None): 

 if self.mode == 'prox': 

 for idx, param in enumerate(model.parameters()): 

 param.data += conf.learning_rate * conf.proxy_mu * (param.data - global_model[idx])

Inconsistency in the dataset directory

The README says 20 datasets, the download script has 16 or so, the data directory has 15 or 16.
Naming of the datasets are inconsistent too; e.g., iNature vs iNaturalist
Using a single letter in the download script is also confusing and short-sighted. There may be more datasets than letters in the alphabet. A convention would be --dataset-name

Merge basic mobile backend

Creating an issue to track progress of @singam-sanjay's progress.

Bug in qfedavg