tl-system / plato Goto Github PK
View Code? Open in Web Editor NEWA federated learning framework to support scalable and reproducible research
License: Apache License 2.0
A federated learning framework to support scalable and reproducible research
License: Apache License 2.0
Describe the bug
A TypeError saying that object 'DataSource' has no len() and is not subscriptable was raised when I tried to run the split learning example
To Reproduce
Expected behavior
No error should be raised
Screenshots
Process Process-4:
Traceback (most recent call last):
File "\Programs\Python\Python39\lib\multiprocessing\process.py", line 315, in _bootstrap
self.run()
File "\Programs\Python\Python39\lib\multiprocessing\process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "\plato-home\plato\trainers\basic.py", line 238, in train_process
raise training_exception
File "\plato-home\plato\trainers\basic.py", line 127, in train_process
self.train_model(config, trainset, sampler.get(), cut_layer)
File "\plato-home\plato\examples\split_learning\split_learning_trainer.py", line 38, in train_model
iterations_per_epoch = np.ceil(len(trainset) / batch_size).astype(int)
File "\plato-home\plato\plato\datasources\feature_dataset.py", line 10, in __len__
return len(self.dataset)
TypeError: object of type 'DataSource' has no len()
OS environment:
Additional context
Possible Solution:
I resolve the error by first adding the code below to the class DataSource
in file plato/datasources/feature.py:
def __len__(self):
return len(self.trainset)
Then the program gives another TypeError saying that 'DataSource' object is not subscriptable as following:
Process Process-4:
Traceback (most recent call last):
File "\Programs\Python\Python39\lib\multiprocessing\process.py", line 315, in _bootstrap
self.run()
File "\Programs\Python\Python39\lib\multiprocessing\process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "\plato-home\plato\trainers\basic.py", line 238, in train_process
raise training_exception
File "\plato-home\plato\trainers\basic.py", line 127, in train_process
self.train_model(config, trainset, sampler.get(), cut_layer)
File "\plato-home\examples\split_learning\split_learning_trainer.py", line 65, in train_model
for __, (examples, labels) in enumerate(train_loader):
File "\Users\venv\lib\site-packages\torch\utils\data\dataloader.py", line 521, in __next__
data = self._next_data()
File "\Users\venv\lib\site-packages\torch\utils\data\dataloader.py", line 561, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "\Users\venv\lib\site-packages\torch\utils\data\_utils\fetch.py", line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "\Users\venv\lib\site-packages\torch\utils\data\_utils\fetch.py", line 49, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "\plato-home\plato\datasources\feature_dataset.py", line 13, in __getitem__
image, label = self.dataset[item]
TypeError: 'DataSource' object is not subscriptable
So I add another function (as shown below) to the class DataSource
in plato/datasources/feature.py:
def __getitem__(self, item):
return self.trainset[item]
The solution resolves the error, but I'm not sure whether this is the right way to go.
Development of android FL client to further enhance the simulation of FL on mobile devices
Update Split Learning so that it works with the current release of Plato.
The following lines of code in clients/base.py needs to be refactored:
if 'fedrl' in data: # Update the number of local aggregation rounds Config().cross_silo = Config().cross_silo._replace( rounds=data['fedrl']) Config().training = Config().training._replace( rounds=data['fedrl'])
When training models in a separate process in the trainers, a model filename should be unique to the client ID.
The default values in configuration files can be better tuned so that most configuration files can be shorter. These default values should assume that a single server, with one or multiple GPUs, is used for most runs.
Some ideas:
server.comm_simulation
should default to true
.data.data_path
, server.model_dir
, server.checkpoint_dir
, and results.result_dir
should all be relative to the base path.server.ping_timeout
and server.ping_interval
should have sufficiently large default values.Occasionally having RuntimeError: unable to open shared memory object </torch_68252_3878038395> in read-write mode. But if running with the same command again, may not have this error.
I use this framework to reproduce some algorithms which are helpful to accelerate convergence in iid and non-iid data distribution setting, such as FedAdp, Fedatt, FedProx, all these algorithms are in your ./examples dir. I ran these algorithms and plot the accuracy on test dataset, and found that the curves have almost no difference. There is no bug or execution confuse when I check the code. So I wonder if you can show me some examples and settings that these algorithms perform a more significant advance on convergence than FedAvg. It would be a great pleasure for me if you give me some brief explanations and advise to reproduce these algorithm.
Best Regards.
Describe the bug
plato/plato/algorithms/tensorflow/fedavg.py
Lines 22 to 23 in f9a8f77
Expected behavior
In Federated Learning, the server should not directly access the data locally.
This code assumes the server can access all the data when it creates some objects. Maybe we can distinguish the dataset from its metadata (number of classes, labels, etc).
Additional context
This is not a runtime bug.
We tried 500 epochs and the test results are still zero. Then we checked the code and found that NMS was not used on line 310 of "Plato /trainers/yolo.py" and the shape of 'pred' in line 312 was different from the shape of 'pred' in the original Yolov5. After we retrain by adding NMS, the results of 100 and 300 epochs are as follows:
Describe the bug
I guess it is a mechanism, which detects some unused sockets and closes them.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
[INFO][05:12:15]: [Client # 4] Contacting the central server.
[INFO][05:12:15]: [Client # 4] Connecting to the server at http://172.18.0.2:30496.
[INFO][05:12:15]: [Client # 4] Connected to the server.
[INFO][05:12:15]: [Client # 4] Waiting to be selected.
packet queue is empty, aborting
[ERROR][06:12:20]: packet queue is empty, aborting
OS environment (please complete the following information):
Is your feature request related to a problem? Please describe.
In cross-device practice, clients will join and leave from time to time, instead of staying online all the time. Moreover, their behaviors differ from one another, which is referred to as state heterogeneity in the literature [1, 2, 3]. We are thus interested in enhancing Plato with the support for emulating/simulating clients' variability in the available status.
Describe the solution you'd like
From our perspective, clients' dropout needs to be implemented in a programmatic way. For example, we can introduce the use of a predefined trace, of which each entry indicates the transition time of a particular client's availability. For example, an entry may look like
[client_id]: [starting status], [transition time (in seconds) t1, t2, ...]
e.g., 1: on, [54, 890, 2042, ...]
Of course, the trace can also be generated in a programmatic way, e.g., with each client's arrival following a Poison distribution, if appropriate. Thus, we anticipate that the generation of the availability traces can be added as a library in ~/packages
.
Then, with compatible parsing logic, Plato's runtime can be informed of clients' availability dynamics, and act accordingly. Specifically,
For emulation mode, instead of performing start_clients()
in one go at the very beginning, we may want to make the action of client starting
one kind of event that is triggered by time, following the loaded trace. Similarly, client shutdown
may also be regarded as a time-driven event.
For simulation mode, it should be as simple as varying the set of selectable clients according to the trace, i.e., the major change of code might take place at the select_clients()
or choose_clients()
part.
Major concern: it looks like we have to make a groundbreaking change, e.g., adding a new module as we do not think that this logic is a new kind of server/client/algorithm... We would like to listen to the authors' opinions if any. Thanks!
Additional context
[1] C. Yang, Q. Wang, M. Xu, Z. Chen, K. Bian, Y. Liu, and X. Liu, Characterizing Impacts of Heterogeneity in Federated Learning upon Large-Scale Smartphone Data, in WWW, 2021.
[2] F. Lai, Y. Dai, X. Zhu, and M. Chowdhury, FedScale: Benchmarking Model and System Performance of Federated Learning, in arXiv:2105.11367, 2021.
[3] Z. Jiang and W. Wang, System Optimization in Synchronous Federated Training: A Survey, in arXiv:2109.03999, 2021.
When running experiment related to cross-silo FL, eg., using fedavg_cross_silo or fedrl, may occasionally get a ValueError:
Traceback (most recent call last):
File "/Users/user/opt/miniconda3/envs/federated/lib/python3.8/site-packages/websockets/server.py", line 191, in handler
await self.ws_handler(self, path)
File "/Users/user/Documents/GitHub/fl-plato/plato/servers/base.py", line 117, in serve
await self.process_reports()
File "/Users/user/Documents/GitHub/fl-plato/plato/servers/fedavg.py", line 150, in process_reports
self.accuracy = self.trainer.test(self.testset)
File "/Users/user/Documents/GitHub/fl-plato/plato/trainers/trainer.py", line 298, in test
self.pause_training()
File "/Users/user/Documents/GitHub/fl-plato/plato/trainers/base.py", line 38, in pause_training
trainer_count = int(file.read())
ValueError: invalid literal for int() with base 10: ''
It seems like ./running_trainers may be empty some time or several processes read it at the same time could cause this problem?
Also, might need to add an id to this file name just in case running several experiments at the same time.
Hi there,
The attribute per_round is set to 20 in examples/fedasync/fedasync_MNIST_lenet5.yml
.
I thought there is no client selection mechanism in FedAsync. (paper: Asynchronous Federated Optimization)
Is my understanding wrong?
Thanks.
Describe the bug
If I run the program with differential privacy, then there is a bug.
To Reproduce
Uncomment the lines following whether to apply differential privacy in configs/EMNIST/fedavg_lenet5.yml
Execute ./run -c configs/EMNIST/fedg_lenet5.yml
Expected behavior
The program should run smoothly without bug (I did not get any error before).
OS environment (please complete the following information):
Describe the bug
In __init__()
of class Server
in file plato/servers/fedavg_cs.py, the function uses attribute self.do_edge_test
at line 38 which is not defined previously.
To Reproduce
Steps to reproduce the behavior:
fedavg_cs.Server
:examples/axiothea/axiothea.py
examples/cs_maml/cs_maml.py
examples/rhythm/rhythm.py
examples/tempo/tempo.py
Expected behavior
No error should be raised when running these examples
Screenshots
[INFO][15:51:14]: [Server #18644] Started training on 1 clients with 1 per round.
Traceback (most recent call last):
File "\plato-project-home\examples\tempo\tempo.py", line 19, in <module>
main()
File "\plato-project-home\examples\tempo\tempo.py", line 11, in main
server = tempo_server.Server()
File "\plato-project-home\examples\tempo\tempo_server.py", line 20, in __init__
super().__init__()
File "\plato-project-home\plato\servers\fedavg_cs.py", line 38, in __init__
and not self.do_edge_test):
AttributeError: 'Server' object has no attribute 'do_edge_test'
OS environment:
Additional context
none
Describe the bug
An URLError (SSL: certificate verify failed) was raised when I tried to run the adaptive freezing example with CIFAR10 ResNet18
To Reproduce
os.environ['config_file'] = 'adaptive_freezing_CIFAR10_resnet18.yml'
Expected behavior
No error should be raised and the content should be downloaded successfully
OS environment (please complete the following information):
Additional context
Possible Solution:
I resolve the error by adding the code below to file plato/datasources/cifar10.py:
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
I think the code bypasses the error by using the unverified SSL. I'm not sure whether that is a good solution.
When using FashionMNIST and LeNet-5 as the model, the DataLoader would fail with numSamples = 0 when one client is used with all 60000 samples.
Describe the bug
The adaptive freezing example (examples/adaptive_freezing)
always show 0% stable parameters,
even with incredibly large stability_threshold
value (e.g. 0.9) and random_freezing
turned on.
I think that probably all parameters on selected workers are misidentified as un-stable, and therefore sent to server to join training
To Reproduce
Steps to reproduce the behavior:
adaptive_freezing_CIFAR10_resnet18.yml
, set rounds: 50
, stability_threshold: 0.9
, random_freezing:true
python adaptive_freezing.py --config=./adaptive_freezing_CIFAR10_resnet18.yml
Expected behavior
The stable parameters should be correctly identified and exclude from transmission
Screenshots
If applicable, add screenshots to help explain your problem.
OS environment (please complete the following information):
Additional context
After tracing code, I think that inside adaptive_freezing_algorithm.py
,
function update_sync_mask():
should be
self.wake_up_round[name] indices] = self.current_round + self.frozen_durations[name][indices]
self.sync_mask[name] = (self.wake_up_round[name] >= self.current_round)
Since that sync_mask[name] == 1
indicate active parameters, which is opposite to M_is-frozen
in your paper
But after this modification, the Algorithm
object at server-side would get different sync_mask[name]
from client-side which would cause some RuntimeError
In your version, sync_mask[name]
keeps to be true
for all elements on both server and client, therefore might not have any error
Describe the bug
If I run the program on ResNet-18 and VGG-16 with differential privacy, then there is a bug?
To Reproduce
Steps to reproduce the behavior:
Enable differential privacy in configs/CIFAR10/fedavg_vgg16.yml
or configs/CIFAR10/fedavg_resnet18.yml
.
Execute ./run -c configs/CIFAR10/fedavg_vgg16.yml
or ./run -c configs/CIFAR10/fedavg_resnet18.yml
Expected behavior
The program is expected to run smoothly without bug.
Screenshots
If applicable, add screenshots to help explain your problem.
OS environment (please complete the following information):
Describe the bug
Command: python fedasync.py -c fedasync_MNIST_lenet5.yml
in folder /plato/examples/fedasync
.
The following error messages appear after a while, and the training session is stuck.
What is the problem?
Error Messages
[INFO][22:17:53]: [Client #95] Sent 0.25 MB of payload data to the server (simulated).
[INFO][22:17:53]: [Server #3351624] Received 0.25 MB of payload data from client #95 (simulated).
[INFO][22:17:53]: [Server #3351624] Adding client #3 to the list of clients for aggregation.
[INFO][22:17:53]: [Server #3351624] Aggregating 1 clients in total.
[ERROR][22:17:53]: Task exception was never retrieved
future: <Task finished name='Task-243' coro=<AsyncServer._handle_event_internal() done, defined at /home/xuan/anaconda3/envs/plato/lib/python3.9/site-packages/socketio/asyncio_server.py:521> exception=ValueError('too many values to unpack (expected 3)')>
Traceback (most recent call last):
File "/home/xuan/anaconda3/envs/plato/lib/python3.9/site-packages/socketio/asyncio_server.py", line 523, in _handle_event_internal
r = await server._trigger_event(data[0], namespace, sid, *data[1:])
File "/home/xuan/anaconda3/envs/plato/lib/python3.9/site-packages/socketio/asyncio_server.py", line 568, in _trigger_event
return await self.namespace_handlers[namespace].trigger_event(
File "/home/xuan/anaconda3/envs/plato/lib/python3.9/site-packages/socketio/asyncio_namespace.py", line 37, in trigger_event
ret = await handler(*args)
File "/home/xuan/anaconda3/envs/plato/lib/python3.9/site-packages/plato/servers/base.py", line 48, in on_client_report
await self.plato_server.client_report_arrived(sid, data['id'],
File "/home/xuan/anaconda3/envs/plato/lib/python3.9/site-packages/plato/servers/base.py", line 692, in client_report_arrived
await self.process_client_info(client_id, sid)
File "/home/xuan/anaconda3/envs/plato/lib/python3.9/site-packages/plato/servers/base.py", line 784, in process_client_info
await self.process_clients(client_info)
File "/home/xuan/anaconda3/envs/plato/lib/python3.9/site-packages/plato/servers/base.py", line 913, in process_clients
await self.process_reports()
File "/media/massstorage/xuan/plato/examples/fedasync/fedasync_server.py", line 57, in process_reports
__, __, client_staleness = self.updates[0]
ValueError: too many values to unpack (expected 3)
OS environment (please complete the following information):
Thanks.
Describe the bug
When using the asynchronous mode, the server will not be able to proceed in the initial round, waiting forever for clients that will never arrive.
To Reproduce
Use the following configuration file with the ./run -c
command:
clients:
# Type
type: simple
# The total number of clients
total_clients: 500
# The number of clients selected in each round
per_round: 50
# Should the clients compute test accuracy locally?
do_test: false
# Whether client heterogeneity should be simulated
speed_simulation: true
# The distribution of client speeds
simulation_distribution:
distribution: pareto
alpha: 1
# The maximum amount of time for clients to sleep after each epoch
max_sleep_time: 30
# Should clients really go to sleep, or should we just simulate the sleep times?
sleep_simulation: false
# If we are simulating client training times, what is the average training time?
avg_training_time: 20
random_seed: 1
server:
address: 127.0.0.1
port: 8000
ping_timeout: 36000
ping_interval: 36000
# Should we operate in sychronous mode?
synchronous: false
# Should we simulate the wall-clock time on the server? Useful if max_concurrency is specified
simulate_wall_time: true
# What is the minimum number of clients that need to report before aggregation begins?
minimum_clients_aggregated: 15
# What is the staleness bound, beyond which the server should wait for stale clients?
staleness_bound: 10
# Should we send urgent notifications to stale clients beyond the staleness bound?
request_update: false
random_seed: 1
data:
# The training and testing dataset
datasource: MNIST
# Number of samples in each partition
partition_size: 600
# IID or non-IID?
sampler: noniid
# The concentration parameter for the Dirichlet distribution
concentration: 0.5
# The random seed for sampling data
random_seed: 1
trainer:
# The type of the trainer
type: basic
# The maximum number of training rounds
rounds: 5
# The maximum number of clients running concurrently
max_concurrency: 3
# Number of epoches for local training in each communication round
epochs: 1
batch_size: 32
optimizer: SGD
learning_rate: 0.01
momentum: 0.9
weight_decay: 0.0
# The machine learning model
model_name: lenet5
algorithm:
# Aggregation algorithm
type: fedavg
Is your feature request related to a problem? Please describe.
The learning rate (lr) schedule will not work in the current implementation of plato's trainer, i.e., plato/trainers/basic.py
. The training process will suffer from a constant learning rate. The main reason is that the trainer will create a new lr schedule each round. Then, the epoch within one round will always start from 1. Therefore, if the lr schedule works based on this local epoch, the lr will never be modified correctly as that in the central learning. For example, under settings where the initial learning rate is 0.1, and the local epoch is 5, the learning rate will maintain 0.1 during the whole training process (100 communication rounds) no matter what lr schedule is used.
Anyone can reproduce this potential issue by tracking the learning rate in each round.
Describe the solution you'd like
However, general machine learning follows that the learning rate can be decreased as the training epoch increases. I believe that this is why Plato supports the learning rate schedule in the trainer.
It is better to run a training process in which the scheduler decays the client's learning rate when the training loss plateaus.
Describe alternatives you've considered
Therefore, to introduce this property (a changeable learning rate based on the training status) to Plato's training framework, the simplest way is to define a variable computed as:
lr_schedule_base_epoch = (current_round - 1) * epochs
Then, the defined lr_schedule can be updated as:
lr_schedule.step(lr_schedule_base_epoch + epoch)
in the local training stage.
This lets Plato vary the learning rate as the training (communication round) progresses.
Additional context
I have implemented such property of lr schedule in my branch as a changeable lr is important to my model training and most machine learning methods. Without a changeable learning rate, it is hard to beat the state-of-the-art methods equipped with a dynamic learning rate. I am not sure whether this is necessary for others. But, this is a valuable feature that can be added to Plato.
Describe the bug
The bug happens during the training process on the client. The information is "expected scalar type Half but found Float".
To Reproduce
Steps to reproduce the behavior in CPU-only cluster:
./run -c configs/YOLO/fedavg_yolov5.yml
./run -c configs/YOLO/mistnet_yolov5.yml
<- happens at this phase.[INFO][09:27:57]: Training on client #0 failed.
Process Process-2:
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "plato/plato/trainers/basic.py", line 199, in train_process
raise training_exception
File "plato/plato/trainers/basic.py", line 109, in train_process
self.train_model(config, trainset, sampler.get(), cut_layer)
File "plato/plato/trainers/yolo.py", line 182, in train_model
pred = self.model.forward_from(imgs, cut_layer)
File "plato/plato/models/yolo.py", line 67, in forward_from
x = m(x) # run
File "lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "lib/python3.8/site-packages/yolov5/models/common.py", line 138, in forward
return self.cv3(torch.cat((self.m(self.cv1(x)), self.cv2(x)), dim=1))
File "lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "lib/python3.8/site-packages/yolov5/models/common.py", line 42, in forward
return self.act(self.bn(self.conv(x)))
File "lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "lib/python3.8/site-packages/torch/nn/modules/conv.py", line 399, in forward
return self._conv_forward(input, self.weight, self.bias)
File "lib/python3.8/site-packages/torch/nn/modules/conv.py", line 395, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: expected scalar type Half but found Float
Expected behavior
A clear and concise description of what you expected to happen.
OS environment (please complete the following information):
Additional context
Add any other context about the problem here.
For the Non-IID Dirichlet partition, it seems that each client gets a WeightedRandomSampler of dirichlet distribution to sample data from the whole dataset, instead of a real partitioned local dataset.
This may cause a problem that each client can see some new data samples that it has never saw before. This may contradict to the NonIID settings, in which each client can never see other clients' samples.
Federated learning is widely used in several fields, e.g., CV, NLP.
In some cases, the model parameters could be at billions-level. Plato should also provide multiple servers support and use some model partition methods (sharding) to distribute the workloads.
Describe the bug
When running examples/fei.py
with config fei_FashionMNIST_lenet5.yml
, an IndexError was raised in rl_server.py
when the server tried to the federated averaging in the first round.
To Reproduce
Steps to reproduce the behavior:
python examples/fei/fei.py -c fei_FashionMNIST_lenet5.yml
Expected behavior
No error should be raised
Screenshots
The following snippet is the Traceback of the error
[INFO][14:40:48]: [Server #5476] All 10 client report(s) received. Processing.
[INFO][14:40:48]: [RL Agent] Preparing action...
[INFO][14:40:48]: [RL Agent] Selecting action...
[ERROR][14:40:48]: Task exception was never retrieved
...
Traceback (most recent call last):
...
File "path-to-plato\plato\plato\servers\fedavg.py", line 127, in aggregate_weights
update = await self.federated_averaging(updates)
File "path-to-plato\plato\plato\utils\reinforcement_learning\rl_server.py", line 84, in federated_averaging
avg_update[name] += delta * self.smart_weighting[i][0]
IndexError: invalid index to scalar variable.
OS environment (please complete the following information):
Additional context
I tried to remove the [0]
at the end of line 84 in rl_server.py
and the program seems to proceed normally without errors, but I checked for the value of self.smart_weighting
and it is always a vector of ten 0.1
for each round. I'm unsure whether that is the expected behavior.
I run MNIST/mistnet_resnet18.yaml(Modified by mistnet_lenet5.yml) will run out of my memory(32G).
servers/base.py Line 105
Describe the bug
The previous model weights are not loaded at the beginning of each round. In each round, the model is trained on a randomly initialized model not the previous model. (for differential privacy).
It seems that the problem is not in the train_model
function in diff_privacy.py
. I print the weights of the first convolution layer before self.make_model_private()
in train_model
in diff_privacy.py
for each round, the weights are always
conv1.weight tensor([[[[ 0.0702, 0.0594, 0.1692, -0.0529, -0.1473],
[ 0.0173, -0.1123, -0.0269, -0.1801, -0.0909],......
To Reproduce
Steps to reproduce the behavior:
Set total_clients
and per_round
as 1
Execute ./run -c ./configs/EMNIST/fedavg_lenet5.yml
Expected behavior
At the beginning of each round, the model should load the model weights at the end of the previous round.
Screenshots
At the end of the first round (the loss becomes smaller around 3.2)
At the beginning of the second round (the loss is back to 3.85, which is a very likely to be the result of a reinitialized model)
OS environment (please complete the following information):
Describe the bug
Running tests/config_tests.py
tests/data_tests.py
and tests/sampler_tests.py
failed due to missing configuration files:
FileNotFoundError: [Errno 2] No such file or directory: '/home/bli/plato/configs/Kinetics/Models/kinetics_full_models.yml'
To Reproduce
python tests/config_tests.py
python tests/sampler_tests.py
python tests/data_tests.py
Expected behavior
This unit test is expected to pass.
OS environment (please complete the following information):
Additional context
PyLint errors also exist when statically checking this code. Such as attributes defined outside __init__
, variable names do not conform to snake-case naming convention in PEP8, and missing method docstring. These PyLint errors need to be fixed as well as much as possible.
Got this error when running on Compute Canada with more than 20 clients.
With the current configuration for the WideResNet model, training does not converge. Things that need to be added or fixed:
Describe the bug
If the users want to implement their own Algorithm class, Trainer class, and Server class, there is no way to instantiate these three classes following the old ways:
trainer = fedrep_trainer.Trainer
algorithm = fedrep_algorithm.Algorithm(trainer=trainer)
client = fedrep_client.Client(algorithm=algorithm, trainer=trainer)
server = fedrep_server.Server(algorithm=algorithm, trainer=trainer)
or
trainer = fedrep_trainer.Trainer()
algorithm = fedrep_algorithm.Algorithm(trainer=trainer)
client = fedrep_client.Client(algorithm=algorithm, trainer=trainer)
server = fedrep_server.Server(algorithm=algorithm, trainer=trainer)
To Reproduce
Steps to reproduce the behavior:
Three methods, which implemented their own algorithms, generate this bug.
examples/adaptive_freezing
or examples/split_learning
in the main branch or examples/fedrep
in the personalizedFL branchpython examples/fedrep/fedrep.py -c examples/fedrep/fedrep_MNIST_lenet5.yml
AttributeError: type object 'Trainer' has no attribute 'model'
Expected behavior
The code is expected to run smoothly without this error at the starting point.
Additional context
This bug is easy to be located as the Algorithm receives the instantiated trainer as the parameter and then extracts the trainer's model. However, the Client class and Server class receive the Trainer class as the parameter to further define the self.trainer inside, as shown by line 120 self.trainer = self.custom_trainer(model=self.model)
of servers/fedavg.py.
Is your feature request related to a problem? Please describe.
In cross-platform cases, it is highly possible for the client to support different AI frameworks with the server. In my case, the client side is Atlas500, which does not support any AI framework but NNRT. NNRT does not have the concept of tensor and it can only transfer representation features by using numpy array. However, on the server side, other AI frameworks, like Pytorch, expect the input should be Tensor where we need to transfer data type based on different frameworks.
Describe the solution you'd like
It might be convenient for us to share a general data transfer protocol among tensorflow, pytorch, mindspore, nnrt and so on. For example, we can always transfer numpy array data type and perform data type transfer when the server side receives data, no matter which kind of AI frameworks we use in the client side,
protocol Myprotocol {
data: extracted_features (numpy array),
func CheckDataIntergrity(),
func DataTypeTransfer() ... }
The DataTypeTransfer function will transform the data type from numpy array to the target AI platform.
Is your feature request related to a problem? Please describe.
Large network load happens on the exchange of model parameters, features and gradients for some federated learning algorithms. We should provide optional model/feature compression over the transferred data.
Describe the solution you'd like
Optional compression algorithms such as distillation and quantisation and deep compression can be implemented with Processor
interface. Servers and clients can choose to apply compression within the processor pipeline.
Additional context
POLINO, Antonio; PASCANU, Razvan; ALISTARH, Dan. Model compression via distillation and quantization. arXiv preprint arXiv:1802.05668, 2018.
Is your feature request related to a problem? Please describe.
Currently there is only one implementation of local differential privacy (LDP): RAPPOR[1], implemented in https://github.com/TL-System/plato/blob/main/plato/utils/unary_encoding.py and it is not decoupled with algorithm implementation.
plato/plato/algorithms/mistnet.py
Lines 52 to 64 in fac44a6
plato/plato/algorithms/mindspore/mistnet.py
Lines 44 to 48 in fac44a6
plato/examples/nnrt/nnrt_algorithms/mistnet.py
Lines 60 to 65 in fac44a6
This feature request calls for a modular LDP plugin interface and a number of different other methods e.g. [2][3]
Describe the solution you'd like
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
To be filled.
Additional context
Add any other context or screenshots about the feature request here.
[1] Ú. Erlingsson, V. Pihur, and A. Korolova. Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security, pages 1054–1067. ACM, 2014.
[2] Differential Privacy Team, Apple. Learning with privacy at scale. 2017.
[3] B. Ding, J. Kulkarni, and S. Yekhanin. Collecting telemetry data privately. In Advances in Neural Information Processing Systems 30, December 2017.
Describe the bug
The adaptive freezing example (examples/adaptive_freezing) does not work correctly with the CIFAR10 dataset and the ResNet18 model.
To Reproduce
python examples/adaptive_freezing.py, but uses CIFAR10 and ResNet18 in the .yml config file (examples/adaptive_freezing_CIFAR10_resnet18.yml).
Execution terminates with error:
line 114, in update_sync_mask
self.moving_average_deltas[name][indices], deltas[indices])
IndexError: too many indices for tensor of dimension 0
Expected behavior
Run completed successfully without errors.
servers/mistnet.py Line 111 self.accuracy = self.trainer.test(feature_dataset,Config().algorithm.cut_layer)
self.accuracy get None, however, trainers/trainer.py Line 256 return accurary is not none.
Even when most of the samples in the dataset are used for local training, there exist major discrepancies between local and global test accuracies. The issue exists with WideResNet as the model, but other models may exhibit similar behaviour as this issue may not be model-specific.
Describe the bug
Running Custom Dataset need to import tensorflow_datasets
Screenshots
It seems that there are several unnecessary dependencies when the user configures its own dataset and load from base.Datasource
.
Traceback (most recent call last):
File "aggregate.py", line 35, in <module>
run_server()
File "aggregate.py", line 29, in run_server
chooser=simple_chooser)
File "/home/lib/sedna/service/server/aggregation.py", line 325, in __init__
from plato.servers import registry as server_registry
File "/home/plato/plato/servers/registry.py", line 10, in <module>
from plato.servers import (
File "/home/plato/plato/servers/fedavg.py", line 11, in <module>
from plato.algorithms import registry as algorithms_registry
File "/home/plato/plato/algorithms/registry.py", line 23, in <module>
from plato.algorithms.tensorflow import (
File "/home/plato/plato/algorithms/tensorflow/fedavg.py", line 7, in <module>
from plato.datasources import registry as datasources_registry
File "/home/plato/plato/datasources/registry.py", line 19, in <module>
from plato.datasources.tensorflow import (
File "/home/plato/plato/datasources/tensorflow/mnist.py", line 5, in <module>
import tensorflow_datasets as tfds
ModuleNotFoundError: No module named 'tensorflow_datasets'
OS environment (please complete the following information):
Is your feature request related to a problem? Please describe.
I found the latest installation package for plato named plato-learn==0.2.4
at pypi, by installed it via pip, there were two confusions.
pip install "git+https://github.com/TL-System/plato.git"
, pip will install so many packages. As a torch user, why tensorflow2
need to be installed ?Describe the solution you'd like
Describe the bug
For the examples/base_siamese
in personalizedFL sub-branch, I cannot define the custom model to be used as the input for the client and server. If the user directly instantiates the module, such as siamese_mnist_net.SiameseBase()
, line 57 of plato/clients/simple.py
is obviously not correct. If the user does not instantiates the module, such as siamese_mnist_net.SiameseBase
, line 13 of plato/algorithms/fedavg.py
utilizes the model without instantiation.
To Reproduce
Steps to reproduce the behavior:
examples/base_siamese
in personalizedFL sub-branch.python examples/base_siamese/base_siamese.py -c examples/base_siamese/base_siamese_MNIST.yml
.Expected behavior
The code is expected to run smoothly.
Additional context
The error is clear as the Client and the Server use different ways to register the model.
In the client, the line 55 of plato/clients/simple.py
register the model as :
if self.custom_model is not None:
self.model = self.custom_model()
self.custom_model = None
However, in the server, the line 119 of plato/servers/fedavg.py
register the model as :
if self.model is None and self.custom_model is not None:
self.model = self.custom_model
This may be problematic with virtual client IDs, used in simulated clients introduced in the current Plato release.
Describe the bug
In the server/mistnet.py, the sampler class should be initiated with datasets who has the method num_train_examples(). After I check the code, the current version is passed the list of features directly to sample which causes the error.
To Reproduce
./run --config=configs/MNIST/mistnet_lenet5.yml
datasets
and zstd
is not currently supported. Either we trace all their requirements and build them (not practical) or we work around it by other means.There can be other undiscovered roadblocks for mobile clients.
Not sure if these requirements call for redesign of the project structure. I'm not experienced in mobile development either so there can be other issues.
Is your feature request related to a problem? Please describe.
When a client fails for whatever reason — most likely CUDA out of memory errors or other CUDA errors including the CUDA unknown error
, currently the server will pause indefinitely waiting for the failed client to arrive.
This is sometimes what we need when debugging, but in large-scale production runs this is certainly not desirable, and it should be considered incorrect behaviour.
Describe the solution you'd like
We need a new configuration setting in the general
section, called debug
. When debug
is false, the server should gracefully recover from a failed client by launching the client process again, and retrying the local training session with the same client ID that has previously failed. When debug
is true, the server should print an error message, terminate itself, and terminate all processes. A developer can then use the -r
command-line option to resume the federated learning session when needed.
Describe the bug
For those datasets that are not shipped by torch
and thus have to be manually downloaded (e.g., cinic10
, multimodal_base
, pascal_voc
, and tiny_imagenet
), they are currently downloaded as a whole (i.e., the whole training and testing datasets) in the constructors of the respective DataSource instances.
While this design may function well in the testing environment where servers and all the clients colocate in one machine, it may come across with severe concurrency issues in some situations such as that in Deploying a Plato Federated Learning Server in a Production Environment, which Plato also aims to support.
To see that, consider the two cases separately:
configure()
method, and only when the call returns does the server spawns clients in the same machine. In this way, when clients call their configure()
independently, none of them needs to download the dataset, again, as it is well prepared as a whole during the initialization of the server.To Reproduce
This bug should conceptually make sense. We may provide the steps for reproducing it later, if necessary.
Additional context
We spotted this bug during the development of a new feature FEMNIST. Since the solution looks like a non-trivial design problem, we prefer seeking the authors' help before working out any immature solution.
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
That is pretty strange to implement the ResNet, i.e., plato/models/resnet.py
in Plato to only support the CIFAR-related dataset with input size 32. The pooling layer before the fc layer in plato is F.avg_pool2d
, which is less effective.
Maybe the current implementation is to support the 'cut_layer'?
But, still, this is unnecessary to reimplement a model by ourselves as the torchvision
has provided many models. Once we want to remove some layers, just use nn.Identity()
to replace them without any risk.
Describe the solution you'd like
A clear and concise description of what you want to happen.
The most effective way is to follow the implementation of ResNet in torchvision. They utilize the:
self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
to support any input size.
I know the current implementation of ResNet replaces the kernel_size of conv1 7x7 with 3x3, and removes first max pooling to maintain spatial information for input with small sizes, such as 32x32 for CIFAR10.
However, this can be easily achieved by using the torchvision's implementation while setting:
encoder.conv1 = nn.Conv2d(3,
64,
kernel_size=3,
stride=1,
padding=2,
bias=False)
encoder.maxpool = nn.Identity()
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Just reuse torchvision's implementations.
Additional context
Add any other context or screenshots about the feature request here.
In my own work, I directly utilize torchvision's implementations and revise the torchvision's model slightly based on my own requirement. It works well on many different datasets without error. See the models/encoders_register.py
in the contrastive_adaptation branch.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.