symbioticlab / fedscale Goto Github PK
View Code? Open in Web Editor NEWFedScale is a scalable and extensible open-source federated learning (FL) platform.
Home Page: https://fedscale.ai
License: Apache License 2.0
FedScale is a scalable and extensible open-source federated learning (FL) platform.
Home Page: https://fedscale.ai
License: Apache License 2.0
I want to ask what model is used for the stackoverflow and reddit dataset next word prediction training? Is it in the models.py file?
Thanks a lot~
hi, thanks for building this fl library!
I'm trying to run a toy example with FEMNIST. However, the program report grpc error as follows:
E0526 04:57:27.528504823 23761 chttp2_transport.cc:1115] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
E0526 04:57:27.528580558 23761 chttp2_transport.cc:1115] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
E0526 04:57:27.528625026 23761 client_channel.cc:691] chand=0x559f9651a5c8: Illegal keepalive throttling value 9223372036854775807
(05-26) 04:58:28 INFO [client.py:188] Training of (CLIENT: 836) completes, {'clientId': 836, 'moving_loss': 4.184056043624878, 'trained_size': 400, 'success': True, 'utility': 857.4471839768071}
Traceback (most recent call last):
File "/mnt/gaodawei.gdw/FedScale/fedscale/core/executor.py", line 346, in <module>
executor.run()
File "/mnt/gaodawei.gdw/FedScale/fedscale/core/executor.py", line 141, in run
self.event_monitor()
File "/mnt/gaodawei.gdw/FedScale/fedscale/core/executor.py", line 318, in event_monitor
client_id, train_res = self.Train(train_config)
File "/mnt/gaodawei.gdw/FedScale/fedscale/core/executor.py", line 172, in Train
meta_result = None, data_result = None
File "/mnt/gaodawei.gdw/miniconda3/envs/fedscale/lib/python3.7/site-packages/grpc/_channel.py", line 946, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/mnt/gaodawei.gdw/miniconda3/envs/fedscale/lib/python3.7/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Broken pipe"
debug_error_string = "{"created":"@1653541108.923041466","description":"Error received from peer ipv4:0.0.0.0:1000","file":"src/core/lib/surface/call.cc","file_line":1069,"grpc_message":"Broken pipe","grpc_status":14}"
It seems like there are too many pings for the port?
Here is my yaml:
# Configuration file of FAR training experiment
# ========== Cluster configuration ==========
# ip address of the parameter server (need 1 GPU process)
# ps_ip: 10.0.0.1
ps_ip: 0.0.0.0
# ip address of each worker:# of available gpus process on each gpu in this node
# Note that if we collocate ps and worker on same GPU, then we need to decrease this number of available processes on that GPU by 1
# E.g., master node has 4 available processes, then 1 for the ps, and worker should be set to: worker:3
worker_ips:
- 0.0.0.0:[4,4]
# - 10.0.0.1:[4] # worker_ip: [(# processes on gpu) for gpu in available_gpus] eg. 10.0.0.2:[4,4,4,4] This node has 4 gpus, each gpu has 4 processes.
exp_path: /mnt/gaodawei.gdw/FedScale/fedscale/core
# Entry function of executor and aggregator under $exp_path
executor_entry: executor.py
aggregator_entry: aggregator.py
auth:
ssh_user: "root"
ssh_private_key: ~/.ssh/id_rsa
# cmd to run before we can indeed run FAR (in order)
setup_commands:
- source /mnt/gaodawei.gdw/miniconda3/bin/activate fedscale
- export NCCL_SOCKET_IFNAME='enp94s0f0' # Run "ifconfig" to ensure the right NIC for nccl if you have multiple NICs
# ========== Additional job configuration ==========
# Default parameters are specified in argParser.py, wherein more description of the parameter can be found
job_conf:
- job_name: femnist # Generate logs under this folder: log_path/job_name/time_stamp
- log_path: /mnt/gaodawei.gdw/FedScale/evals # Path of log files
- total_worker: 2 # Number of participants per round, we use K=100 in our paper, large K will be much slower
- data_set: femnist # Dataset: openImg, google_speech, stackoverflow
- data_dir: /mnt/gaodawei.gdw/FedScale/dataset/data/femnist # Path of the dataset
- data_map_file: /mnt/gaodawei.gdw/FedScale/dataset/data/femnist/client_data_mapping/train.csv # Allocation of data to each client, turn to iid setting if not provided
- device_conf_file: /mnt/gaodawei.gdw/FedScale/dataset/data/device_info/client_device_capacity # Path of the client trace
- device_avail_file: /mnt/gaodawei.gdw/FedScale/dataset/data/device_info/client_behave_trace
- model: convnet2 # Models: e.g., shufflenet_v2_x2_0, mobilenet_v2, resnet34, albert-base-v2
- gradient_policy: fed-avg # {"fed-yogi", "fed-prox", "fed-avg"}, "fed-avg" by default
- eval_interval: 10 # How many rounds to run a testing on the testing set
- epochs: 50 # Number of rounds to run this training. We use 1000 in our paper, while it may converge w/ ~400 rounds
- filter_less: 0 # Remove clients w/ less than 21 samples
- num_loaders: 8
- yogi_eta: 3e-3
- yogi_tau: 1e-8
- local_steps: 20
- learning_rate: 0.01
- batch_size: 20
- test_bsz: 20
- malicious_factor: 4
- use_cuda: True
- decay_factor: 1. # decay factor of the learning rate
- sample_seed: 12345
- test_ratio: 1.
- loss_decay: 0.
- input_dim: 3
- overcommitment: 1.5
- use_cuda: False
It expects values in the .yml
file and crashes if not provided. Every entry point in FedScale should use the same set of defaults.
I'm trying to run the example of FEMNIST with a two-layer Conv network on a ubuntu server with 8 gpus.
The following is my yaml:
# Configuration file of FAR training experiment
# ========== Cluster configuration ==========
# ip address of the parameter server (need 1 GPU process)
ps_ip: 0.0.0.0
# ip address of each worker:# of available gpus process on each gpu in this node
# Note that if we collocate ps and worker on same GPU, then we need to decrease this number of available processes on that GPU by 1
# E.g., master node has 4 available processes, then 1 for the ps, and worker should be set to: worker:3
worker_ips:
- 0.0.0.0:[7] # worker_ip: [(# processes on gpu) for gpu in available_gpus] eg. 10.0.0.2:[4,4,4,4] This node has 4 gpus, each gpu has 4 processes.
exp_path: /mnt/gaodawei.gdw/FedScale/fedscale/core
# Entry function of executor and aggregator under $exp_path
executor_entry: executor.py
aggregator_entry: aggregator.py
auth:
ssh_user: "root"
ssh_private_key: ~/.ssh/id_rsa
# cmd to run before we can indeed run FAR (in order)
setup_commands:
- source /mnt/gaodawei.gdw/miniconda3/bin/activate fedscale
- export NCCL_SOCKET_IFNAME='enp94s0f0' # Run "ifconfig" to ensure the right NIC for nccl if you have multiple NICs
# ========== Additional job configuration ==========
# Default parameters are specified in argParser.py, wherein more description of the parameter can be found
job_conf:
- job_name: femnist # Generate logs under this folder: log_path/job_name/time_stamp
- log_path: /mnt/gaodawei.gdw/FedScale/evals # Path of log files
- total_worker: 100 # Number of participants per round, we use K=100 in our paper, large K will be much slower
- data_set: femnist # Dataset: openImg, google_speech, stackoverflow
- data_dir: /mnt/gaodawei.gdw/FedScale/dataset/data/femnist # Path of the dataset
- data_map_file: /mnt/gaodawei.gdw/FedScale/dataset/data/femnist/client_data_mapping/train.csv # Allocation of data to each client, turn to iid setting if not provided
- device_conf_file: /mnt/gaodawei.gdw/FedScale/dataset/data/device_info/client_device_capacity # Path of the client trace
- device_avail_file: /mnt/gaodawei.gdw/FedScale/dataset/data/device_info/client_behave_trace
- model: convnet2 # Models: e.g., shufflenet_v2_x2_0, mobilenet_v2, resnet34, albert-base-v2
- gradient_policy: fed-avg # {"fed-yogi", "fed-prox", "fed-avg"}, "fed-avg" by default
- eval_interval: 100 # How many rounds to run a testing on the testing set
- epochs: 400 # Number of rounds to run this training. We use 1000 in our paper, while it may converge w/ ~400 rounds
- rounds: 400
- filter_less: 0 # Remove clients w/ less than 21 samples
- num_loaders: 2
- yogi_eta: 3e-3
- yogi_tau: 1e-8
- local_steps: 10
- learning_rate: 0.05
- decay_factor: 1. # decay factor of the learning rate
- batch_size: 20
- test_bsz: 20
- malicious_factor: 4
- use_cuda: True
- input_dim: 3
- sample_seed: 12345
- test_ratio: 1.
- loss_decay: 0.
FedScale is blocked during training, so I add some logging in aggregator.py to monitor the running status as follows
def event_monitor(self):
logging.info("Start monitoring events ...")
while True:
# Broadcast events to clients
...
# Handle events queued on the aggregator
elif len(self.sever_events_queue) > 0:
client_id, current_event, meta, data = self.sever_events_queue.popleft()
logging.info("Receive event {} from client {}".format(current_event, client_id))
if current_event == events.UPLOAD_MODEL:
self.client_completion_handler(self.deserialize_response(data))
#
logging.info("Currently {}/{} clients has finished training".format(len(self.stats_util_accumulator), self.tasks_round))
if len(self.stats_util_accumulator) == self.tasks_round:
self.round_completion_handler()
elif current_event == events.MODEL_TEST:
self.testing_completion_handler(client_id, self.deserialize_response(data))
else:
logging.error(f"Event {current_event} is not defined")
else:
# execute every 100 ms
time.sleep(0.1)
and I obtain the following logs:
...
(05-26) 14:13:28 INFO [client.py:17] Start to train (CLIENT: 2167) ...
(05-26) 14:13:28 INFO [aggregator.py:628] Receive event upload_model from client 4
(05-26) 14:13:28 INFO [aggregator.py:632] Currently 92/100 clients has finished training
(05-26) 14:13:28 INFO [client.py:188] Training of (CLIENT: 585) completes, {'clientId': 585, 'moving_loss': 4.0690016746521, 'trained_size': 200, 'success': True, 'utility': 751.3793800173086}
(05-26) 14:13:28 INFO [client.py:17] Start to train (CLIENT: 2305) ...
(05-26) 14:13:28 INFO [aggregator.py:628] Receive event upload_model from client 5
(05-26) 14:13:28 INFO [aggregator.py:632] Currently 93/100 clients has finished training
(05-26) 14:13:29 INFO [client.py:188] Training of (CLIENT: 92) completes, {'clientId': 92, 'moving_loss': 4.070298886299133, 'trained_size': 200, 'success': True, 'utility': 740.9585433536305}
(05-26) 14:13:29 INFO [aggregator.py:628] Receive event upload_model from client 3
(05-26) 14:13:29 INFO [aggregator.py:632] Currently 94/100 clients has finished training
(05-26) 14:13:29 INFO [client.py:188] Training of (CLIENT: 2305) completes, {'clientId': 2305, 'moving_loss': 4.075016450881958, 'trained_size': 200, 'success': True, 'utility': 521.201964075297}
(05-26) 14:13:29 INFO [aggregator.py:628] Receive event upload_model from client 5
(05-26) 14:13:29 INFO [aggregator.py:632] Currently 95/100 clients has finished training
(05-26) 14:13:31 INFO [client.py:188] Training of (CLIENT: 1607) completes, {'clientId': 1607, 'moving_loss': 3.9613564729690554, 'trained_size': 200, 'success': True, 'utility': 322.7957952931622}
Traceback (most recent call last):
File "/mnt/gaodawei.gdw/miniconda3/envs/fedscale/lib/python3.7/multiprocessing/queues.py", line 242, in _feed
send_bytes(obj)
File "/mnt/gaodawei.gdw/miniconda3/envs/fedscale/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/mnt/gaodawei.gdw/miniconda3/envs/fedscale/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header + buf)
File "/mnt/gaodawei.gdw/miniconda3/envs/fedscale/lib/python3.7/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
(05-26) 14:13:31 INFO [client.py:188] Training of (CLIENT: 1821) completes, {'clientId': 1821, 'moving_loss': 3.9525305151939394, 'trained_size': 200, 'success': True, 'utility': 321.9902424194754}
(05-26) 14:13:31 INFO [aggregator.py:628] Receive event upload_model from client 7
(05-26) 14:13:31 INFO [aggregator.py:632] Currently 96/100 clients has finished training
(05-26) 14:13:31 INFO [aggregator.py:628] Receive event upload_model from client 1
(05-26) 14:13:31 INFO [aggregator.py:632] Currently 97/100 clients has finished training
(05-26) 14:13:33 INFO [client.py:188] Training of (CLIENT: 137) completes, {'clientId': 137, 'moving_loss': 3.9952051639556885, 'trained_size': 200, 'success': True, 'utility': 643.8445367870678}
(05-26) 14:13:33 INFO [aggregator.py:628] Receive event upload_model from client 2
(05-26) 14:13:33 INFO [aggregator.py:632] Currently 98/100 clients has finished training
(05-26) 14:13:34 INFO [client.py:188] Training of (CLIENT: 2167) completes, {'clientId': 2167, 'moving_loss': 4.092173397541046, 'trained_size': 200, 'success': True, 'utility': 821.8575235110941}
(05-26) 14:13:34 INFO [aggregator.py:628] Receive event upload_model from client 4
(05-26) 14:13:34 INFO [aggregator.py:632] Currently 99/100 clients has finished training
It seems like the server only get 99 models from the clients and the server continues to wait for the missing client.
I guess maybe it is related to the report BrokenPipeError
?
Traceback (most recent call last):
File "/mnt/gaodawei.gdw/miniconda3/envs/fedscale/lib/python3.7/multiprocessing/queues.py", line 242, in _feed
send_bytes(obj)
File "/mnt/gaodawei.gdw/miniconda3/envs/fedscale/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/mnt/gaodawei.gdw/miniconda3/envs/fedscale/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header + buf)
File "/mnt/gaodawei.gdw/miniconda3/envs/fedscale/lib/python3.7/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
So what can I do to deal with it?
Of course, we need implementations but we also need a single config to select the mode of training when submitting a job.
At executor.py
we see
Line 224 in fe2eb88
Line 223 in fe2eb88
So it seems indispensable to have Line 224, but I am not sure how it works. I have tried to remove Line 224, and the training slows down a lot. What on earth is going on with Line 224? How is it related to Batch Normalization?
Include a README in the folder with the format of both traces so that people can use their own trace or contribute new traces.
Hi, I am trying to perform training based on the following config file for femnist dataset. I can run the experiment using two virtual machines. One as a parameter server and the other as a worker. However, if I increase the number of workers, let's say two workers, I run into the following error (please see the next comment).
Any thought on this?
In the optimizer.py
, the ClientOptimizer
class does not have an attribute named mode
, however, it has been referenced in update_client_weight
function:
Line 57 in 26b83da
The address of "Open Image for detection" dataset is not correct, hence, the download of the dataset fails accordingly:
Lines 194 to 200 in 5a691b9
According to the example configuration file, e.g., Line 50 in core/evals/configs/openimage/conf.yml
, we should have client_data_mapping/train.csv
after downloading the OpenImage dataset after running the
bash download.sh -A
at dataset
. However, I cannot find the csv file in either of the locations:
dataset/data/open_images
;dataset/open_images
,which led to the following exception in the log file when I try to execute python manager.py submit configs/openimage/conf.yml
at core/evals
:
...
Traceback (most recent call last):
File ".../FedScale/core/examples/poisoning_setting/customized_executor.py", line 25, in <module>
executor.run()
File ".../FedScale/core/examples/poisoning_setting/../../executor.py", line 129, in run
self.training_sets, self.testing_sets = self.init_data()
File ".../FedScale/core/examples/poisoning_setting/../../executor.py", line 103, in init_data
train_dataset, test_dataset = init_dataset()
File ".../FedScale/core/examples/poisoning_setting/../../fllibs.py", line 211, in init_dataset
train_dataset = OpenImage(args.data_dir, dataset='train', transform=train_transform)
File ".../FedScale/core/examples/poisoning_setting/../../utils/openimage.py", line 59, in __init__
self.data, self.targets = self.load_file(self.path)
File ".../FedScale/core/examples/poisoning_setting/../../utils/openimage.py", line 120, in load_file
datas, labels = self.load_meta_data(os.path.join(self.processed_folder, 'client_data_mapping', self.data_file+'.csv'))
File ".../FedScale/core/examples/poisoning_setting/../../utils/openimage.py", line 106, in load_meta_data
with open(path) as csv_file:
FileNotFoundError: [Errno 2] No such file or directory: '.../FedScale/dataset/open_images/client_data_mapping/train.csv'
Actually what is in the dataset/open_images
are only:
$ ls dataset/open_images/
classTags clientDataMap test train
and what is in the dataset/data/open_images
are only:
$ ls dataset/data/open_images
clientDataMap README.md
I therefore wonder if the download logic has changed and where I can find the train.csv
. Could you please help me with this issue?
The code is using model.parameters()
to get all the learnable parameters from the clients and perform aggregation, e.g., in
Line 179 in 33de86d
model.parameters()
does not include BN statistics such as running_mean
, running_var
, or num_batches_tracked
, so I think the code is currently not aggregating those statistics across clients properly (please correct me if I am wrong).model.state_dict().items()
.Found this while I was struggling to make FEMNIST benchmark work nicely.
Have you ran FEMNIST before and got decent accuracy with this codebase? If so, I might be misunderstanding how to use your code -- in that case, I would love to know exactly what config you used.
Otherwise, I think fixing this bug might solve my issue and #49.
At Line 228 of file core/aggregator.py
, function run
is going to calculate the model size in kbits.
I was wondering why is the divisor in the expression 1024 instead of 1000? According to my understanding, a bit rate is typically expressed in conjunction with an SI prefix (so 1 kbits/s = 1000 bits/s) instead of a Binary prefix (which specifies 1 Kibit/s = 1024 bits/s).
Oort
and FedScale
is in kbps
which actually means kbits/s
, such an inconsistency actually alienated the results from reality. (But it doesn't mean that your results are not making sense, as we can percept it as scaling up all the clients' bandwidth by a factor 1024/1000 which is pretty much close to 1 with presumably negligible side-effect.Kibit/s
, then surely nothing goes wrong at all.Anyway, you may consider changing the divisor to 1000 if the common consensus is that a bit rate is expressed with an SI prefix. This is particularly helpful if a user wants to use its own capacity dataset.
The server_event_queue
in the executor.py
throws KeyError as self.control_manager.get_server_event_que'+str(self.this_rank)
has not been initialized for second executor and the rest of them:
Line 162 in 501f91c
OS: Ubuntu 18.03 with bash as the default shell
Running on CPU with more than one worker
Logs and Error message:
(11-20) 02:28:59 INFO [executor.py:142] Started GRPC server at [::]:50002
(11-20) 02:28:59 INFO [executor.py:160] Successfully connect to the aggregator
Traceback (most recent call last):
File "/home/ubuntu/FedScale/core/executor.py", line 349, in <module>
executor.run()
File "/home/ubuntu/FedScale/core/executor.py", line 204, in run
self.setup_communication()
File "/home/ubuntu/FedScale/core/executor.py", line 112, in setup_communication
self.init_control_communication(self.args.ps_ip, self.args.manager_port)
File "/home/ubuntu/FedScale/core/executor.py", line 162, in init_control_communication
self.server_event_queue = eval('self.control_manager.get_server_event_que'+str(self.this_rank)+'()')
File "<string>", line 1, in <module>
File "/home/ubuntu/anaconda3/envs/fedscale/lib/python3.6/multiprocessing/managers.py", line 662, in temp
token, exp = self._create(typeid, *args, **kwds)
File "/home/ubuntu/anaconda3/envs/fedscale/lib/python3.6/multiprocessing/managers.py", line 556, in _create
id, exposed = dispatch(conn, None, 'create', (typeid,)+args, kwds)
File "/home/ubuntu/anaconda3/envs/fedscale/lib/python3.6/multiprocessing/managers.py", line 82, in dispatch
raise convert_to_error(kind, result)
multiprocessing.managers.RemoteError:
---------------------------------------------------------------------------
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/fedscale/lib/python3.6/multiprocessing/managers.py", line 195, in handle_request
result = func(c, *args, **kwds)
File "/home/ubuntu/anaconda3/envs/fedscale/lib/python3.6/multiprocessing/managers.py", line 353, in create
self.registry[typeid]
KeyError: 'get_server_event_que2'
---------------------------------------------------------------------------
The manager.py
separates the worker using ;
. In a Linux environment with bash set as the default shell the ;
is interpreted as the end of a command. This causes a problem when running executor.py
over ssh.
I have seen studies that use the aggregated model to evaluate the accuracy over a test dataset. However, in the aggregator.py
, the overall test accuracy is calculated by averaging the test accuracy of each individual client's test accuracy:
Lines 398 to 399 in 33de86d
I am wondering if these two approaches will end in the same result? Have you seen any other references use the same approach for calculating overall accuracy?
In addition to this, regardless of the participating clients that are chosen for training in each round (epoch), test accuracy is always evaluated on the dataset with executor rank id (self.this_rank
):
Line 294 in 33de86d
I am wondering if these factors might have some effects on the low accuracy for the femnist experiment related to #57 and #49? Can you provide any references for the above cases?
In the given example config files (e.g. core/evals/configs/openimage/conf.yml), the comments suggesting values for gradient_policy
variables as fed-avg
, fed-prox
and fed-yogi
, while different values are being used in the optimizer.py file. For instance:
Line 27 in fefd06d
Line 58 in fefd06d
Line 16 in fefd06d
It is also mentioned the default value for gradient_policy
is fed-avg
, however, it does not seem this has been defined anywhere. Clarification is appreciated.
https://aml.engr.tamu.edu/book-dswe/dswe-datasets/
@dywsjtu can you look it up? Most of them are really small, but the 4. Wind Spatio-Temporal Dataset2
can be useful with 200 clients.
We can discuss here.
When I ran Albert
atop reddit
, I came across with such an exception:
Traceback (most recent call last):
File ".../FedScale/core/aggregator.py", line 509, in <module>
aggregator.run()
File ".../FedScale/core/aggregator.py", line 225, in run
self.model = self.init_model()
File ".../FedScale/core/aggregator.py", line 132, in init_model return init_model()
File ".../FedScale/core/fllibs.py", line 83, in init_model
tokenizer = AlbertTokenizer.from_pretrained(args.model, do_lower_case=True)
File ".../anaconda3/envs/fedscale/lib/python3.6/site-packages/transformers/utils/dummy_sentencepiece_objects.py", line 11, in from_pretrained
requires_backends(cls, ["sentencepiece"])
File ".../anaconda3/envs/fedscale/lib/python3.6/site-packages/transformers/file_utils.py", line 612, in requires_backends
raise ImportError("".join([BACKENDS_MAPPING[backend][1].format(name) for backend in backends]))
ImportError:
AlbertTokenizer requires the SentencePiece library but it was not found in your environment. Checkout the instructions on the
installation page of its repo: https://github.com/google/sentencepiece#installation and follow the ones
that match your environment.
which is resolved after I conducted
pip install sentencepiece
in the fedscale
environment. I am thus wondering if you want to add this dependency into the environment.yml
for friendly supporting users.
[W ParallelNative.cpp:229] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
/Users/mosharaf/opt/anaconda3/envs/fedscale/lib/python3.7/site-packages/torchvision/transforms/functional_pil.py:42: DeprecationWarning: FLIP_LEFT_RIGHT is deprecated and will be removed in Pillow 10 (2023-07-01). Use Transpose.FLIP_LEFT_RIGHT instead. return img.transpose(Image.FLIP_LEFT_RIGHT)
Training seems to continue.
Line 333 in 24ffc1b
self.param should be param?
When running on a single host, around 10 worker initializations made in the following line fail. This is possibly due to exceeding the number of new concurrent connections to sshd (MaxStartups).
FedScale/core/evals/manager.py
Line 87 in 2af119c
Waiting for the previous connections to succeed, just like it is done for the PS, has resolved this issue for me.
FedScale/core/evals/manager.py
Line 74 in 2af119c
This issue might not surface in multi-node setups due to network latencies inherently delaying requests for new connections, as opposed to single-node setups where the request never leaves the node (i.e. when relayed to 127.0.0.1).
Please consider adding a delay before initiating the next connection.
For example, status
or job-status
to see the status of the current job; view-config
to see the configuration file used. We can discuss here to come up with a list of common things someone may need.
The following data is still missing after re-downloading just now.
train_img_to_path
train_img_to_target
test_img_to_path
test_img_to_target
It's completely flat structure under job_conf
which makes it hard to follow and organize. A better hierarchical organization under this will allow us, for example, to separate aggregator-related params from worker-related ones; the same for job-related params from trace-related params; etc.
The preprocessed Stackoverflow dataset is missing in https://github.com/SymbioticLab/FedScale/tree/master/dataset/data/stackoverflow.
Should the samplers be "random" and "oort", instead of "random and "kuiper" as is mentioned in the file?
Lines 129 to 142 in fe2eb88
We notice there are some legacy issues in this tutorial. For example:
AmberLJC/FedScale.git
;from fedscale
instead of from FedScale
?Partition the dataset by the clientclient_data_mapping file
Can you please try to update it? Thanks!
FedScale is currently built under an assumption that the central server communicates with clients based on push mechanism where the server initiates signals (handshake/training/notTraining/etc) to available devices. Is it possible to consider a pull based system where the device initiates the communication by sending requests to the server to ask for next step actions on a periodic basis? In a realistic setting, the device would periodically ping the server when its training criteria is met (enough data, sufficient battery, app open, etc.), and the server would respond with the model gradients for federated training if the client is selected.
This will make initial conda setup faster (basically, delete the dependency from the environment.yml), and people can avoid installing unnecessary packages.
The current install.sh
assumes a lot of things. Until we have a very sophisticated version that auto-detects OS, GPU etc. and does a lot of the heavy lifting, a slightly improved approach could be giving specific instructions for what to do, where
yml
;environment.yml
also has too many things to be installed that may only be needed for specific dataset/model combinations. Instead of forcing people to install everything, we can make dataset or model specific aspects part of the dataset download.sh
part that will update the conda environment. This will avoid wholesale failure.
Backstory: I was trying to install FedScale on a pre-m1 macbook pro.
time_to_acc.pdf
I have run the femnist task with the following setting, but the accuracy has not changed after about 700 rounds. The train loss is around 15, any suggestions?
ps_ip: 127.0.0.1
worker_ips:
- 127.0.0.1:[2] # worker_ip: [(# processes on gpu) for gpu in available_gpus] eg. 10.0.0.2:[4,4,4,4] This node has 4 gpus
exp_path: /opt/shared/Loss/newFedScale/core
executor_entry: executor.py
aggregator_entry: aggregator.py
auth:
ssh_user: ""
ssh_private_key: ~/.ssh/id_rsa
setup_commands:
- source /opt/shared/anaconda3/bin/activate fedscale
- export NCCL_SOCKET_IFNAME='enp94s0f0'
job_conf:
- job_name: femnist # Generate logs under this folder: log_path/job_name/time_stamp
- log_path: /opt/shared/Loss/newFedScale/core/evals # Path of log files
- task: cv
- total_worker: 17 # Number of participants per round, we use K=100 in our paper, large K will be much slower
- data_set: femnist # Dataset: openImg, google_speech, stackoverflow
- data_dir: /opt/shared/Loss/newFedScale/dataset/data/femnist # Path of the dataset
- data_map_file: /opt/shared/Loss/newFedScale/dataset/data/femnist/client_data_mapping/train.csv
- device_conf_file: /opt/shared/Loss/newFedScale/dataset/data/device_info/client_device_capacity # Path of the client trace
- device_avail_file: /opt/shared/Loss/newFedScale/dataset/data/device_info/client_behave_trace
- model: shufflenet_v2_x2_0 # Models: e.g., shufflenet_v2_x2_0, mobilenet_v2, resnet34, albert-base-v2
- gradient_policy: fed-avg # {"fed-yogi", "fed-prox", "fed-avg"}, "fed-avg" by default
- eval_interval: 5 # How many rounds to run a testing on the testing set
- epochs: 1000 # Number of rounds to run this training. We use 1000 in our paper, while it may converge w/ ~400 rounds
- filter_less: 21 # Remove clients w/ less than 21 samples
- num_loaders: 2
- yogi_eta: 3e-5
- yogi_tau: 1e-8
- local_steps: 10
- learning_rate: 0.05
- batch_size: 2
- test_bsz: 20
mosharaf@Mosharafs-MacBook-Pro:~/Work/FedScale/dataset|master ⇒ python
Python 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 05:57:50)
[Clang 11.1.0 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from torch.utils.data import DataLoader
[1] 26092 segmentation fault python
Anyone can take a look at qfedavg or have run the new version of qfedavg?
https://github.com/SymbioticLab/FedScale/blob/master/core/optimizer.py#L26
I set q=1 and it never converges as before, which is the wrong version though.
But, I cannot find any reason why that fails.
Occasionally, FileNotFoundError is thrown for the temp_model_path
in the load_global_model
function of the executor.py
:
Lines 238 to 242 in 33de86d
OS: Ubuntu 18.03 with bash as the default shell
Running on CPU with more than one worker
(11-20) 15:03:22 INFO [executor.py:326] Executor 2: Received (Event:TRAIN) from aggregator
Traceback (most recent call last):
File "/home/ubuntu/FedScale/core/executor.py", line 349, in <module>
executor.run()
File "/home/ubuntu/FedScale/core/executor.py", line 205, in run
self.event_monitor()
File "/home/ubuntu/FedScale/core/executor.py", line 332, in event_monitor
train_res = self.training_handler(clientId=clientId, conf=client_conf)
File "/home/ubuntu/FedScale/core/executor.py", line 266, in training_handler
client_model = self.load_global_model()
File "/home/ubuntu/FedScale/core/executor.py", line 240, in load_global_model
with open(self.temp_model_path, 'rb') as model_in:
FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/FedScale/core/evals/logs/femnist/1120_150258/executor/model_3.pth.tar'
Traceback (most recent call last):
File "/home/ubuntu/FedScale/core/executor.py", line 349, in <module>
executor.run()
File "/home/ubuntu/FedScale/core/executor.py", line 205, in run
self.event_monitor()
File "/home/ubuntu/FedScale/core/executor.py", line 332, in event_monitor
train_res = self.training_handler(clientId=clientId, conf=client_conf)
File "/home/ubuntu/FedScale/core/executor.py", line 266, in training_handler
client_model = self.load_global_model()
File "/home/ubuntu/FedScale/core/executor.py", line 240, in load_global_model
with open(self.temp_model_path, 'rb') as model_in:
FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/FedScale/core/evals/logs/femnist/1120_150258/executor/model_2.pth.tar'
(11-20) 15:03:23 INFO [executor.py:326] Executor 1: Received (Event:TRAIN) from aggregator
ps_ip: 10.30.72.19
worker_ips:
- 10.30.72.9:[1]
- 10.30.72.29:[1]
- 10.30.72.30:[1]
# - 10.30.72.31:[1]
# - 10.30.72.32:[1]
# - 10.30.72.33:[1]
# - 10.30.72.34:[1]
exp_path: $HOME/FedScale/core
executor_entry: executor.py
aggregator_entry: aggregator.py
auth:
ssh_user: ""
ssh_private_key: ~/.ssh/id_rsa
setup_commands:
- source $HOME/anaconda3/bin/activate fedscale
- export NCCL_SOCKET_IFNAME='eth0'
job_conf:
- job_name: femnist
- log_path: $HOME/FedScale/core/evals
- total_worker: 4
- data_set: femnist
- data_dir: $HOME/FedScale/dataset/data/femnist
- data_map_file: $HOME/FedScale/dataset/data/femnist/client_data_mapping/train.csv
- gradient_policy: yogi
- eval_interval: 30
- epochs: 3
- filter_less: 21
- num_loaders: 1
- yogi_eta: 3e-3
- yogi_tau: 1e-8
- local_steps: 20
- learning_rate: 0.05
- batch_size: 20
- test_bsz: 20
- malicious_factor: 4
In many places, FedScale uses $HOME assuming that a user has installed FedScale in their $HOME directory. Instead, it should have a $FEDSCALE_HOME variable or something like that and then have paths relative to that to avoid scripts silently failing.
It's too large for anyone to have patience; may even be prohibitive in terms of space on someone's laptop. It should be based on a smaller dataset like FEMNIST.
There is a lot of repetitions between different tutorials that doesn't make sense either.
Line 23 in fe2eb88
Should the gradient_square be the one compute with β1 based on the old gradients?
Algorithm2 Line12 https://arxiv.org/pdf/2003.00295.pdf
Everything is basically in a flat hierarchy and hard to figure out who might be related to each other without going through the code.
Some of the tutorials are in colab only and some in jupyter only; ALL should be in md with links as needed. Having a concise markdown version is easier to skim through, whereas a notebook (jupyter or colab) has too much unnecessary stuff.
@dywsjtu can keep an eye on this when setting up the docs.
According to my observation, the client system behavior dataset that FedScale's examples currently rely on, i.e., the file dataset/data/client_device_capacity
, is a pickled Python dictionary containing 500,000 keys, each of which represents a client and is only associated with two values: an inference speed and a network bandwidth. In other words, it is both model-agnostic and data-agnostic when it comes to estimate the training speed—no matter (1) how large the ML model is or (2) how large a data sample is, the estimated time for training over a mini-batch of data always stays the same as long as the batch size doe not vary, which is counterfactual.
Hence, I am thinking that client_device_capacity
is merely a small subset of the AI Benchmark
dataset that FedScale claims to embrace. According to Sec. 3.2 in the paper,
AI Benchmark provides the training and inference speed of diverse models (e.g., MobileNet) across a wide range of device models.
I hereby request a complete version of AI Benchmark for gaining more confidence with the time-related metrics reported by FedScale. Looking forward to your help!
Hi Authors,
I notice that environment.yml
has undergone a change last month. Particularly, in the last old version, it requires torch==1.1.0
and torchvision==0.4.0
. As torchvision==0.4.0
actually requires torch 1.2.0
, an environment installation based on this version of `environment.yml' finally ends up with
torch==1.2.0
torchvision==0.4.0
While in the latest version, it does not specify the version of torch
and torchvision
, which should refers to the currently stable release. As of today, the installation based on this version of environment.yml
yields
torch==1.9.0+cu102
torchvision==0.10.0+cu102
Definitely, it would be great to see FedScale being able to run over the latest version of torch
and torchvision
. However, in practice, I found that the latest version of torch
and/or torchvision
will bring many many warning messages like
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
to the log file when I perform any training task. This is quite space-consuming and my current walkaround method is to downgrade the version of torch
and torchvision
to
torch==1.2.0
torchvision==0.4.0
This is inspired by the related discussion on this site. I am curious about whether you have more decent walkaround of this issue. Also, let me know if you think my solution is a general one and I can create/extend a corresponding PR ;)
For example https://docs.python.org/3/library/logging.html
Trying to run experiments using femnist dataset, a problem arises as a result of missing the following files:
FedScale/core/utils/femnist.py
Lines 109 to 121 in 26b83da
README
says 20 datasets, the download
script has 16 or so, the data
directory has 15 or 16.download
script is also confusing and short-sighted. There may be more datasets than letters in the alphabet. A convention would be --dataset-name
Creating an issue to track progress of @singam-sanjay's progress.
Line 329 in aa0edfa
I am not sure I got this bug when running this line
AttributeError: 'list' object has no attribute 'dim'
And I change it to
hs += (qfedq * np.float_power(loss+1e-10, (qfedq-1)) * torch.norm(torch.stack(([torch.norm( g, 2) for g in grads]))) + (1.0/learning_rate) * np.float_power(loss+1e-10, qfedq))
Ideal solution would be native support. If that's too difficult, as a first step, we should add a tutorial.
Creating a placeholder to keep track of documentation status. There is no other obvious place.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.