tl-system / plato Goto Github PK

A federated learning framework to support scalable and reproducible research

License: Apache License 2.0

Python 98.74% Shell 1.13% Dockerfile 0.13%

plato's Issues

[BUG]TypeError related to 'DatsSource' object when running the split learning example

Describe the bug
A TypeError saying that object 'DataSource' has no len() and is not subscriptable was raised when I tried to run the split learning example

To Reproduce

Go to examples/split_learning
Execute split_learning.py
Encounter TypeError: object of type "DataSource" has no len()

Expected behavior
No error should be raised

Screenshots

Process Process-4:
Traceback (most recent call last):
  File "\Programs\Python\Python39\lib\multiprocessing\process.py", line 315, in _bootstrap
    self.run()
  File "\Programs\Python\Python39\lib\multiprocessing\process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "\plato-home\plato\trainers\basic.py", line 238, in train_process
    raise training_exception
  File "\plato-home\plato\trainers\basic.py", line 127, in train_process
    self.train_model(config, trainset, sampler.get(), cut_layer)
  File "\plato-home\plato\examples\split_learning\split_learning_trainer.py", line 38, in train_model
    iterations_per_epoch = np.ceil(len(trainset) / batch_size).astype(int)
  File "\plato-home\plato\plato\datasources\feature_dataset.py", line 10, in __len__
    return len(self.dataset)
TypeError: object of type 'DataSource' has no len()

OS environment:

OS: Windows 10 64-bit
Using Python 3.9
Using IDE PyCharm Community Edition 2021.3

Additional context
Possible Solution:
I resolve the error by first adding the code below to the class DataSource in file plato/datasources/feature.py:

def __len__(self):
    return len(self.trainset)

Then the program gives another TypeError saying that 'DataSource' object is not subscriptable as following:

Process Process-4:
Traceback (most recent call last):
  File "\Programs\Python\Python39\lib\multiprocessing\process.py", line 315, in _bootstrap
    self.run()
  File "\Programs\Python\Python39\lib\multiprocessing\process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "\plato-home\plato\trainers\basic.py", line 238, in train_process
    raise training_exception
  File "\plato-home\plato\trainers\basic.py", line 127, in train_process
    self.train_model(config, trainset, sampler.get(), cut_layer)
  File "\plato-home\examples\split_learning\split_learning_trainer.py", line 65, in train_model
    for __, (examples, labels) in enumerate(train_loader):
  File "\Users\venv\lib\site-packages\torch\utils\data\dataloader.py", line 521, in __next__
    data = self._next_data()
  File "\Users\venv\lib\site-packages\torch\utils\data\dataloader.py", line 561, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "\Users\venv\lib\site-packages\torch\utils\data\_utils\fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "\Users\venv\lib\site-packages\torch\utils\data\_utils\fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "\plato-home\plato\datasources\feature_dataset.py", line 13, in __getitem__
    image, label = self.dataset[item]
TypeError: 'DataSource' object is not subscriptable

So I add another function (as shown below) to the class DataSource in plato/datasources/feature.py:

def __getitem__(self, item):
    return self.trainset[item]

The solution resolves the error, but I'm not sure whether this is the right way to go.

[RFC]Android Clients

Development of android FL client to further enhance the simulation of FL on mobile devices

Approach

Use of chaquo to adapt the current Python code base to Android.
- Chaquo is not open source, but it provides free license for open source projects.
- Chaquo is the only Python to Android tool that has PyTorch packaged.
- Building PyTorch for Android in other tools require significant amount of work.
Use of redroid to support multiple instances of android devices.
- Redroid is Android in container, using the same kernel as the host.
- The performance of Redroid is close to the host, making multiple Android instances possible.
Separate log server to receive log entries from android clients.
- There is no good way to directly extract log contents from Android app.
- Using an HTTP log server and modifying the logging handler in clients can handle the logs nicely.

The split learning example was based on a previous implementation of Plato and stopped working.

Update Split Learning so that it works with the current release of Plato.

clients/base.py needs to be refactored to remove references to RL

The following lines of code in clients/base.py needs to be refactored:

                  if 'fedrl' in data:
                        # Update the number of local aggregation rounds
                        Config().cross_silo = Config().cross_silo._replace(
                            rounds=data['fedrl'])
                        Config().training = Config().training._replace(
                            rounds=data['fedrl'])

Models trained should be saved with unique model filename

When training models in a separate process in the trainers, a model filename should be unique to the client ID.

[FR] Adjusting default values for configurations

The default values in configuration files can be better tuned so that most configuration files can be shorter. These default values should assume that a single server, with one or multiple GPUs, is used for most runs.

Some ideas:

server.comm_simulation should default to true.
A base path should be a configuration settings. data.data_path, server.model_dir, server.checkpoint_dir, and results.result_dir should all be relative to the base path.
server.ping_timeout and server.ping_interval should have sufficiently large default values.

Occasionally had RuntimeError: unable to open shared memory object in read-write mode.

Occasionally having RuntimeError: unable to open shared memory object </torch_68252_3878038395> in read-write mode. But if running with the same command again, may not have this error.

[RFC] Found no advantages in model fusion methods

I use this framework to reproduce some algorithms which are helpful to accelerate convergence in iid and non-iid data distribution setting, such as FedAdp, Fedatt, FedProx, all these algorithms are in your ./examples dir. I ran these algorithms and plot the accuracy on test dataset, and found that the curves have almost no difference. There is no bug or execution confuse when I check the code. So I wonder if you can show me some examples and settings that these algorithms perform a more significant advance on convergence than FedAvg. It would be a great pleasure for me if you give me some brief explanations and advise to reproduce these algorithm.

Best Regards.

[Bug] Privacy issue: dataset and metadata of dataset

Describe the bug

plato/plato/algorithms/tensorflow/fedavg.py

Lines 22 to 23 in f9a8f77

 datasource = datasources_registry.get() 

 self.model.build_model(datasource.input_shape())

Expected behavior
In Federated Learning, the server should not directly access the data locally.
This code assumes the server can access all the data when it creates some objects. Maybe we can distinguish the dataset from its metadata (number of classes, labels, etc).

Additional context
This is not a runtime bug.

[BUG]

We tried 500 epochs and the test results are still zero. Then we checked the code and found that NMS was not used on line 310 of "Plato /trainers/yolo.py" and the shape of 'pred' in line 312 was different from the shape of 'pred' in the original Yolov5. After we retrain by adding NMS, the results of 100 and 300 epochs are as follows:

URLError related to VOCsegmentation dataset.

[BUG] packet queue is empty, aborting

Describe the bug
I guess it is a mechanism, which detects some unused sockets and closes them.

To Reproduce
Steps to reproduce the behavior:

run mistnet-yolov5 example with several clients (4 in my case)
when the server uses the CPU to train the feature dataset, it usually needs a long time, and some clients will generate the bug message, which is similar to timeout.

Expected behavior
[INFO][05:12:15]: [Client # 4] Contacting the central server.
[INFO][05:12:15]: [Client # 4] Connecting to the server at http://172.18.0.2:30496.
[INFO][05:12:15]: [Client # 4] Connected to the server.
[INFO][05:12:15]: [Client # 4] Waiting to be selected.
packet queue is empty, aborting
[ERROR][06:12:20]: packet queue is empty, aborting

OS environment (please complete the following information):

Ubuntu
20.04

Add Support for State Heterogeneity [FR]

Is your feature request related to a problem? Please describe.

In cross-device practice, clients will join and leave from time to time, instead of staying online all the time. Moreover, their behaviors differ from one another, which is referred to as state heterogeneity in the literature [1, 2, 3]. We are thus interested in enhancing Plato with the support for emulating/simulating clients' variability in the available status.

Describe the solution you'd like

From our perspective, clients' dropout needs to be implemented in a programmatic way. For example, we can introduce the use of a predefined trace, of which each entry indicates the transition time of a particular client's availability. For example, an entry may look like

[client_id]: [starting status], [transition time (in seconds) t1, t2, ...]
e.g., 1: on, [54, 890, 2042, ...]

Of course, the trace can also be generated in a programmatic way, e.g., with each client's arrival following a Poison distribution, if appropriate. Thus, we anticipate that the generation of the availability traces can be added as a library in ~/packages.

Then, with compatible parsing logic, Plato's runtime can be informed of clients' availability dynamics, and act accordingly. Specifically,

For emulation mode, instead of performing start_clients() in one go at the very beginning, we may want to make the action of client starting one kind of event that is triggered by time, following the loaded trace. Similarly, client shutdown may also be regarded as a time-driven event.
For simulation mode, it should be as simple as varying the set of selectable clients according to the trace, i.e., the major change of code might take place at the select_clients() or choose_clients() part.

Major concern: it looks like we have to make a groundbreaking change, e.g., adding a new module as we do not think that this logic is a new kind of server/client/algorithm... We would like to listen to the authors' opinions if any. Thanks!

Additional context

[1] C. Yang, Q. Wang, M. Xu, Z. Chen, K. Bian, Y. Liu, and X. Liu, Characterizing Impacts of Heterogeneity in Federated Learning upon Large-Scale Smartphone Data, in WWW, 2021.
[2] F. Lai, Y. Dai, X. Zhu, and M. Chowdhury, FedScale: Benchmarking Model and System Performance of Federated Learning, in arXiv:2105.11367, 2021.
[3] Z. Jiang and W. Wang, System Optimization in Synchronous Federated Training: A Survey, in arXiv:2109.03999, 2021.

Issues related to ./running_trainers

When running experiment related to cross-silo FL, eg., using fedavg_cross_silo or fedrl, may occasionally get a ValueError:
Traceback (most recent call last):
File "/Users/user/opt/miniconda3/envs/federated/lib/python3.8/site-packages/websockets/server.py", line 191, in handler
await self.ws_handler(self, path)
File "/Users/user/Documents/GitHub/fl-plato/plato/servers/base.py", line 117, in serve
await self.process_reports()
File "/Users/user/Documents/GitHub/fl-plato/plato/servers/fedavg.py", line 150, in process_reports
self.accuracy = self.trainer.test(self.testset)
File "/Users/user/Documents/GitHub/fl-plato/plato/trainers/trainer.py", line 298, in test
self.pause_training()
File "/Users/user/Documents/GitHub/fl-plato/plato/trainers/base.py", line 38, in pause_training
trainer_count = int(file.read())
ValueError: invalid literal for int() with base 10: ''

It seems like ./running_trainers may be empty some time or several processes read it at the same time could cause this problem?

Also, might need to add an id to this file name just in case running several experiments at the same time.

[BUG]

The test results of mistnet_YoloV5 and fedavg_yolov5 are all zero

[RFC] Configuration of attribute in FedAsync

Hi there,

The attribute per_round is set to 20 in examples/fedasync/fedasync_MNIST_lenet5.yml.
I thought there is no client selection mechanism in FedAsync. (paper: Asynchronous Federated Optimization)
Is my understanding wrong?

Thanks.

[BUG] Opacus 1.1.2 ceased to work with the differential privacy support in Plato.

Describe the bug
If I run the program with differential privacy, then there is a bug.

To Reproduce

Uncomment the lines following whether to apply differential privacy in configs/EMNIST/fedavg_lenet5.yml
Execute ./run -c configs/EMNIST/fedg_lenet5.yml

Expected behavior
The program should run smoothly without bug (I did not get any error before).

Screenshots

OS environment (please complete the following information):

OS: Ubuntu Linux

[BUG] AttributeEror: fedavg_cs.Server has no 'do_edge_test'

Describe the bug
In __init__() of class Server in file plato/servers/fedavg_cs.py, the function uses attribute self.do_edge_test at line 38 which is not defined previously.

To Reproduce
Steps to reproduce the behavior:

Run any of the following examples that use the class fedavg_cs.Server:

examples/axiothea/axiothea.py
examples/cs_maml/cs_maml.py
examples/rhythm/rhythm.py
examples/tempo/tempo.py

See AttributeError

Expected behavior
No error should be raised when running these examples

Screenshots

[INFO][15:51:14]: [Server #18644] Started training on 1 clients with 1 per round.
Traceback (most recent call last):
  File "\plato-project-home\examples\tempo\tempo.py", line 19, in <module>
    main()
  File "\plato-project-home\examples\tempo\tempo.py", line 11, in main
    server = tempo_server.Server()
  File "\plato-project-home\examples\tempo\tempo_server.py", line 20, in __init__
    super().__init__()
  File "\plato-project-home\plato\servers\fedavg_cs.py", line 38, in __init__
    and not self.do_edge_test):
AttributeError: 'Server' object has no attribute 'do_edge_test'

OS environment:

OS: Windows 10 64-bit
Using Python 3.9
Using IDE PyCharm Community Edition 2021.3

Additional context
none

[BUG] URLError when running adaptive freezing example with CIFAR10 ResNet18

Describe the bug
An URLError (SSL: certificate verify failed) was raised when I tried to run the adaptive freezing example with CIFAR10 ResNet18

To Reproduce

Change line 17 of file examples/adaptive_feeezing/adaptive_freezing.py to:
os.environ['config_file'] = 'adaptive_freezing_CIFAR10_resnet18.yml'
Ensure that this is the first time running this example; i.e., no content was downloaded and directory data/ and model/ have not been created yet under directory examples/adaptive_feeezing/
Execute adaptive_freeezing.py
Encounter URLError, SSL certificate verify failed, when the program attempts to download the dataset and model

Expected behavior
No error should be raised and the content should be downloaded successfully

OS environment (please complete the following information):

OS: Windows 10 64-bit
Using Python 3.9
Using IDE PyCharm Community Edition 2021.3

Additional context
Possible Solution:
I resolve the error by adding the code below to file plato/datasources/cifar10.py:

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

I think the code bypasses the error by using the unverified SSL. I'm not sure whether that is a good solution.

DataLoader failed in trainer with one client and 60000 samples.

When using FashionMNIST and LeNet-5 as the model, the DataLoader would fail with numSamples = 0 when one client is used with all 60000 samples.

[BUG] Ratio of stable parameters is always 0 in Adaptive Freezing

Describe the bug
The adaptive freezing example (examples/adaptive_freezing) always show 0% stable parameters,
even with incredibly large stability_threshold value (e.g. 0.9) and random_freezing turned on.
I think that probably all parameters on selected workers are misidentified as un-stable, and therefore sent to server to join training

To Reproduce
Steps to reproduce the behavior:

In adaptive_freezing_CIFAR10_resnet18.yml, set rounds: 50, stability_threshold: 0.9, random_freezing:true
python adaptive_freezing.py --config=./adaptive_freezing_CIFAR10_resnet18.yml

Expected behavior
The stable parameters should be correctly identified and exclude from transmission

Screenshots
If applicable, add screenshots to help explain your problem.

OS environment (please complete the following information):

OS: Ubuntu 18.04
Environment:
- PyTorch 1.10.1
- Python 3.9
- CUDA 11.3

Additional context
After tracing code, I think that inside adaptive_freezing_algorithm.py,
function update_sync_mask(): should be

self.wake_up_round[name] indices] = self.current_round + self.frozen_durations[name][indices]
self.sync_mask[name] = (self.wake_up_round[name] >= self.current_round)

Since that sync_mask[name] == 1 indicate active parameters, which is opposite to M_is-frozen in your paper

But after this modification, the Algorithm object at server-side would get different sync_mask[name] from client-side which would cause some RuntimeError
In your version, sync_mask[name] keeps to be true for all elements on both server and client, therefore might not have any error

[BUG] Opacus 1.1.2 could not work on ResNet-18 and VGG-16

Describe the bug
If I run the program on ResNet-18 and VGG-16 with differential privacy, then there is a bug?

To Reproduce
Steps to reproduce the behavior:

Enable differential privacy in configs/CIFAR10/fedavg_vgg16.yml or configs/CIFAR10/fedavg_resnet18.yml.
Execute ./run -c configs/CIFAR10/fedavg_vgg16.yml or ./run -c configs/CIFAR10/fedavg_resnet18.yml

Expected behavior
The program is expected to run smoothly without bug.

Screenshots
If applicable, add screenshots to help explain your problem.

OS environment (please complete the following information):

OS: Ubuntu Linux

[BUG] Executing error in Fedasync

Describe the bug

Command: python fedasync.py -c fedasync_MNIST_lenet5.yml in folder /plato/examples/fedasync.
The following error messages appear after a while, and the training session is stuck.
What is the problem?

Error Messages

[INFO][22:17:53]: [Client #95] Sent 0.25 MB of payload data to the server (simulated).
[INFO][22:17:53]: [Server #3351624] Received 0.25 MB of payload data from client #95 (simulated).
[INFO][22:17:53]: [Server #3351624] Adding client #3 to the list of clients for aggregation.
[INFO][22:17:53]: [Server #3351624] Aggregating 1 clients in total.
[ERROR][22:17:53]: Task exception was never retrieved
future: <Task finished name='Task-243' coro=<AsyncServer._handle_event_internal() done, defined at /home/xuan/anaconda3/envs/plato/lib/python3.9/site-packages/socketio/asyncio_server.py:521> exception=ValueError('too many values to unpack (expected 3)')>
Traceback (most recent call last):
  File "/home/xuan/anaconda3/envs/plato/lib/python3.9/site-packages/socketio/asyncio_server.py", line 523, in _handle_event_internal
    r = await server._trigger_event(data[0], namespace, sid, *data[1:])
  File "/home/xuan/anaconda3/envs/plato/lib/python3.9/site-packages/socketio/asyncio_server.py", line 568, in _trigger_event
    return await self.namespace_handlers[namespace].trigger_event(
  File "/home/xuan/anaconda3/envs/plato/lib/python3.9/site-packages/socketio/asyncio_namespace.py", line 37, in trigger_event
    ret = await handler(*args)
  File "/home/xuan/anaconda3/envs/plato/lib/python3.9/site-packages/plato/servers/base.py", line 48, in on_client_report
    await self.plato_server.client_report_arrived(sid, data['id'],
  File "/home/xuan/anaconda3/envs/plato/lib/python3.9/site-packages/plato/servers/base.py", line 692, in client_report_arrived
    await self.process_client_info(client_id, sid)
  File "/home/xuan/anaconda3/envs/plato/lib/python3.9/site-packages/plato/servers/base.py", line 784, in process_client_info
    await self.process_clients(client_info)
  File "/home/xuan/anaconda3/envs/plato/lib/python3.9/site-packages/plato/servers/base.py", line 913, in process_clients
    await self.process_reports()
  File "/media/massstorage/xuan/plato/examples/fedasync/fedasync_server.py", line 57, in process_reports
    __, __, client_staleness = self.updates[0]
ValueError: too many values to unpack (expected 3)

OS environment (please complete the following information):

OS: Ubuntu Linux
Version : 20.04.4 LTS

Thanks.

[BUG] Asynchronous mode does not proceed in the initial round

Describe the bug
When using the asynchronous mode, the server will not be able to proceed in the initial round, waiting forever for clients that will never arrive.

To Reproduce

Use the following configuration file with the ./run -c command:

clients:
    # Type
    type: simple

    # The total number of clients
    total_clients: 500

    # The number of clients selected in each round
    per_round: 50

    # Should the clients compute test accuracy locally?
    do_test: false

    # Whether client heterogeneity should be simulated
    speed_simulation: true

    # The distribution of client speeds
    simulation_distribution:
        distribution: pareto
        alpha: 1

    # The maximum amount of time for clients to sleep after each epoch
    max_sleep_time: 30

    # Should clients really go to sleep, or should we just simulate the sleep times?
    sleep_simulation: false

    # If we are simulating client training times, what is the average training time?
    avg_training_time: 20

    random_seed: 1

server:
    address: 127.0.0.1
    port: 8000

    ping_timeout: 36000
    ping_interval: 36000

    # Should we operate in sychronous mode?
    synchronous: false

    # Should we simulate the wall-clock time on the server? Useful if max_concurrency is specified
    simulate_wall_time: true

    # What is the minimum number of clients that need to report before aggregation begins?
    minimum_clients_aggregated: 15

    # What is the staleness bound, beyond which the server should wait for stale clients?
    staleness_bound: 10

    # Should we send urgent notifications to stale clients beyond the staleness bound?
    request_update: false

    random_seed: 1

data:
    # The training and testing dataset
    datasource: MNIST 

    # Number of samples in each partition
    partition_size: 600 

    # IID or non-IID?
    sampler: noniid

    # The concentration parameter for the Dirichlet distribution
    concentration: 0.5

    # The random seed for sampling data
    random_seed: 1

trainer:
    # The type of the trainer
    type: basic 

    # The maximum number of training rounds
    rounds: 5

    # The maximum number of clients running concurrently
    max_concurrency: 3

    # Number of epoches for local training in each communication round
    epochs: 1
    batch_size: 32
    optimizer: SGD
    learning_rate: 0.01
    momentum: 0.9
    weight_decay: 0.0

    # The machine learning model
    model_name: lenet5 

algorithm:
    # Aggregation algorithm
    type: fedavg

[FR] A more effective support for the learning rate schedule

Is your feature request related to a problem? Please describe.
The learning rate (lr) schedule will not work in the current implementation of plato's trainer, i.e., plato/trainers/basic.py. The training process will suffer from a constant learning rate. The main reason is that the trainer will create a new lr schedule each round. Then, the epoch within one round will always start from 1. Therefore, if the lr schedule works based on this local epoch, the lr will never be modified correctly as that in the central learning. For example, under settings where the initial learning rate is 0.1, and the local epoch is 5, the learning rate will maintain 0.1 during the whole training process (100 communication rounds) no matter what lr schedule is used.

Anyone can reproduce this potential issue by tracking the learning rate in each round.

Describe the solution you'd like

However, general machine learning follows that the learning rate can be decreased as the training epoch increases. I believe that this is why Plato supports the learning rate schedule in the trainer.

It is better to run a training process in which the scheduler decays the client's learning rate when the training loss plateaus.

Describe alternatives you've considered

Therefore, to introduce this property (a changeable learning rate based on the training status) to Plato's training framework, the simplest way is to define a variable computed as:

lr_schedule_base_epoch = (current_round - 1) * epochs

Then, the defined lr_schedule can be updated as:

lr_schedule.step(lr_schedule_base_epoch + epoch) in the local training stage.

This lets Plato vary the learning rate as the training (communication round) progresses.

Additional context

I have implemented such property of lr schedule in my branch as a changeable lr is important to my model training and most machine learning methods. Without a changeable learning rate, it is hard to beat the state-of-the-art methods equipped with a dynamic learning rate. I am not sure whether this is necessary for others. But, this is a valuable feature that can be added to Plato.

[RuntimeError]: expected scalar type Half but found Float

Describe the bug
The bug happens during the training process on the client. The information is "expected scalar type Half but found Float".

To Reproduce
Steps to reproduce the behavior in CPU-only cluster:

./run -c configs/YOLO/fedavg_yolov5.yml
./run -c configs/YOLO/mistnet_yolov5.yml <- happens at this phase.
Output shows:

[INFO][09:27:57]: Training on client #0 failed.
Process Process-2:
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "plato/plato/trainers/basic.py", line 199, in train_process
    raise training_exception
  File "plato/plato/trainers/basic.py", line 109, in train_process
    self.train_model(config, trainset, sampler.get(), cut_layer)
  File "plato/plato/trainers/yolo.py", line 182, in train_model
    pred = self.model.forward_from(imgs, cut_layer)
  File "plato/plato/models/yolo.py", line 67, in forward_from
    x = m(x)  # run
  File "lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "lib/python3.8/site-packages/yolov5/models/common.py", line 138, in forward
    return self.cv3(torch.cat((self.m(self.cv1(x)), self.cv2(x)), dim=1))
  File "lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "lib/python3.8/site-packages/yolov5/models/common.py", line 42, in forward
    return self.act(self.bn(self.conv(x)))
  File "lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "lib/python3.8/site-packages/torch/nn/modules/conv.py", line 399, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "lib/python3.8/site-packages/torch/nn/modules/conv.py", line 395, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: expected scalar type Half but found Float

Expected behavior
A clear and concise description of what you expected to happen.

OS environment (please complete the following information):

OS: Ubuntu
Version 20.04

Additional context
Add any other context about the problem here.

[BUG] NonIID partition issues

For the Non-IID Dirichlet partition, it seems that each client gets a WeightedRandomSampler of dirichlet distribution to sample data from the whole dataset, instead of a real partitioned local dataset.

This may cause a problem that each client can see some new data samples that it has never saw before. This may contradict to the NonIID settings, in which each client can never see other clients' samples.

[RFC] Multiple servers for keeping a large model.

Federated learning is widely used in several fields, e.g., CV, NLP.
In some cases, the model parameters could be at billions-level. Plato should also provide multiple servers support and use some model partition methods (sharding) to distribute the workloads.

[BUG] IndexError when running examples/fei/fei.py

Describe the bug
When running examples/fei.py with config fei_FashionMNIST_lenet5.yml, an IndexError was raised in rl_server.py when the server tried to the federated averaging in the first round.

To Reproduce
Steps to reproduce the behavior:

Run python examples/fei/fei.py -c fei_FashionMNIST_lenet5.yml
Wait until the server does the first round of federated averaging
Encounter IndexError

Expected behavior
No error should be raised

Screenshots
The following snippet is the Traceback of the error

[INFO][14:40:48]: [Server #5476] All 10 client report(s) received. Processing.
[INFO][14:40:48]: [RL Agent] Preparing action...
[INFO][14:40:48]: [RL Agent] Selecting action...
[ERROR][14:40:48]: Task exception was never retrieved
...
Traceback (most recent call last):
...
  File "path-to-plato\plato\plato\servers\fedavg.py", line 127, in aggregate_weights
    update = await self.federated_averaging(updates)
  File "path-to-plato\plato\plato\utils\reinforcement_learning\rl_server.py", line 84, in federated_averaging
    avg_update[name] += delta * self.smart_weighting[i][0]
IndexError: invalid index to scalar variable.

OS environment (please complete the following information):

OS: Windows
Version: Python 3.9
Using PyCharm IDE Community Edition 2021.3

Additional context
I tried to remove the [0] at the end of line 84 in rl_server.py and the program seems to proceed normally without errors, but I checked for the value of self.smart_weighting and it is always a vector of ten 0.1 for each round. I'm unsure whether that is the expected behavior.

MemoryError

I run MNIST/mistnet_resnet18.yaml（Modified by mistnet_lenet5.yml) will run out of my memory(32G).
servers/base.py Line 105

[BUG]The previous model weights are not loaded at the beginning of each round (for differential privacy).

Describe the bug
The previous model weights are not loaded at the beginning of each round. In each round, the model is trained on a randomly initialized model not the previous model. (for differential privacy).

It seems that the problem is not in the train_model function in diff_privacy.py. I print the weights of the first convolution layer before self.make_model_private() in train_model in diff_privacy.py for each round, the weights are always
conv1.weight tensor([[[[ 0.0702, 0.0594, 0.1692, -0.0529, -0.1473],
[ 0.0173, -0.1123, -0.0269, -0.1801, -0.0909],......

To Reproduce
Steps to reproduce the behavior:

Enable differential privacy in fedavg_lenet5.yml
Set total_clients and per_round as 1
Execute ./run -c ./configs/EMNIST/fedavg_lenet5.yml

Expected behavior
At the beginning of each round, the model should load the model weights at the end of the previous round.

Screenshots
At the end of the first round (the loss becomes smaller around 3.2)

At the beginning of the second round (the loss is back to 3.85, which is a very likely to be the result of a reinitialized model)

OS environment (please complete the following information):

OS: Ubuntu Linux

[BUG] `tests/config_tests.py` `tests/data_tests.py` and `tests/sampler_tests.py` failed due to missing configuration files

Describe the bug
Running tests/config_tests.py tests/data_tests.py and tests/sampler_tests.py failed due to missing configuration files:

FileNotFoundError: [Errno 2] No such file or directory: '/home/bli/plato/configs/Kinetics/Models/kinetics_full_models.yml'

To Reproduce

python tests/config_tests.py
python tests/sampler_tests.py
python tests/data_tests.py

Expected behavior

This unit test is expected to pass.

OS environment (please complete the following information):

OS: Ubuntu Linux
Version: 20.04 LTS

Additional context

PyLint errors also exist when statically checking this code. Such as attributes defined outside __init__, variable names do not conform to snake-case naming convention in PEP8, and missing method docstring. These PyLint errors need to be fixed as well as much as possible.

sqlite3.OperationalError: database is locked

Got this error when running on Compute Canada with more than 20 clients.

Convergence failed for the WideResNet model

With the current configuration for the WideResNet model, training does not converge. Things that need to be added or fixed:

Add a learning rate schedule in addition to the optimizer;
Review (and compare) the training procedure in https://github.com/xternalz/WideResNet-pytorch/.

[BUG] cannot assign the custom trainer to Algorithm, Client, and Server simultaneously.

Describe the bug
If the users want to implement their own Algorithm class, Trainer class, and Server class, there is no way to instantiate these three classes following the old ways:

    trainer = fedrep_trainer.Trainer
    algorithm = fedrep_algorithm.Algorithm(trainer=trainer)
    client = fedrep_client.Client(algorithm=algorithm, trainer=trainer)
    server = fedrep_server.Server(algorithm=algorithm, trainer=trainer)

    trainer = fedrep_trainer.Trainer()
    algorithm = fedrep_algorithm.Algorithm(trainer=trainer)
    client = fedrep_client.Client(algorithm=algorithm, trainer=trainer)
    server = fedrep_server.Server(algorithm=algorithm, trainer=trainer)

To Reproduce
Steps to reproduce the behavior:
Three methods, which implemented their own algorithms, generate this bug.

See the examples/adaptive_freezing or examples/split_learning in the main branch or examples/fedrep in the personalizedFL branch
Run the code, for example in fedrep method, python examples/fedrep/fedrep.py -c examples/fedrep/fedrep_MNIST_lenet5.yml
See error AttributeError: type object 'Trainer' has no attribute 'model'

Expected behavior
The code is expected to run smoothly without this error at the starting point.

Additional context
This bug is easy to be located as the Algorithm receives the instantiated trainer as the parameter and then extracts the trainer's model. However, the Client class and Server class receive the Trainer class as the parameter to further define the self.trainer inside, as shown by line 120 self.trainer = self.custom_trainer(model=self.model) of servers/fedavg.py.

Add general data transfer protocol between client and server [FR]

Is your feature request related to a problem? Please describe.
In cross-platform cases, it is highly possible for the client to support different AI frameworks with the server. In my case, the client side is Atlas500, which does not support any AI framework but NNRT. NNRT does not have the concept of tensor and it can only transfer representation features by using numpy array. However, on the server side, other AI frameworks, like Pytorch, expect the input should be Tensor where we need to transfer data type based on different frameworks.

Describe the solution you'd like
It might be convenient for us to share a general data transfer protocol among tensorflow, pytorch, mindspore, nnrt and so on. For example, we can always transfer numpy array data type and perform data type transfer when the server side receives data, no matter which kind of AI frameworks we use in the client side,

protocol Myprotocol {
    data: extracted_features (numpy array),
    func CheckDataIntergrity(),
    func DataTypeTransfer() ... }

The DataTypeTransfer function will transform the data type from numpy array to the target AI platform.

[FR] Model/Feature Compression via Distillation, Quantization and Deep Compression

Is your feature request related to a problem? Please describe.

Large network load happens on the exchange of model parameters, features and gradients for some federated learning algorithms. We should provide optional model/feature compression over the transferred data.

Describe the solution you'd like

Optional compression algorithms such as distillation and quantisation and deep compression can be implemented with Processor interface. Servers and clients can choose to apply compression within the processor pipeline.

Additional context

POLINO, Antonio; PASCANU, Razvan; ALISTARH, Dan. Model compression via distillation and quantization. arXiv preprint arXiv:1802.05668, 2018.

[FR] Local Differential Privacy Methods

Is your feature request related to a problem? Please describe.
Currently there is only one implementation of local differential privacy (LDP): RAPPOR[1], implemented in https://github.com/TL-System/plato/blob/main/plato/utils/unary_encoding.py and it is not decoupled with algorithm implementation.

plato/plato/algorithms/mistnet.py

Lines 52 to 64 in fac44a6

 _randomize = getattr(self.trainer, "randomize", None) 

 for inputs, targets, *__ in data_loader: 

 with torch.no_grad(): 

 logits = self.model.forward_to(inputs, cut_layer) 

 if epsilon is not None: 

 logits = logits.detach().numpy() 

 logits = unary_encoding.encode(logits) 

 if callable(_randomize): 

 logits = self.trainer.randomize( 

 logits, targets, epsilon) 

 else: 

 logits = unary_encoding.randomize(logits, epsilon)

plato/plato/algorithms/mindspore/mistnet.py

Lines 44 to 48 in fac44a6

 if epsilon is not None: 

 logits = logits.asnumpy() 

 logits = unary_encoding.encode(logits) 

 logits = unary_encoding.randomize(logits, epsilon) 

 logits = mindspore.Tensor(logits.astype('float32'))

plato/examples/nnrt/nnrt_algorithms/mistnet.py

Lines 60 to 65 in fac44a6

 if epsilon is not None: 

 logits = unary_encoding.encode(logits) 

 if callable(_randomize): 

 logits = self.trainer.randomize(logits, targets, epsilon) 

 else: 

 logits = unary_encoding.randomize(logits, epsilon)

This feature request calls for a modular LDP plugin interface and a number of different other methods e.g. [2][3]

Describe the solution you'd like

~~Unified data exchange format between clients and server.~~
A modular interface for plugging in data processing modules into the server-client data exchange.
A config entry for enabling specific data processing modules.
LDP modules implementation.
Test on the theoretical property of modules i.e. ε-LDP

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
To be filled.

Additional context
Add any other context or screenshots about the feature request here.
[1] Ú. Erlingsson, V. Pihur, and A. Korolova. Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security, pages 1054–1067. ACM, 2014.
[2] Differential Privacy Team, Apple. Learning with privacy at scale. 2017.
[3] B. Ding, J. Kulkarni, and S. Yekhanin. Collecting telemetry data privately. In Advances in Neural Information Processing Systems 30, December 2017.

[BUG] Adaptive Freezing doesn't work correctly with CIFAR10 and ResNet18

Describe the bug

The adaptive freezing example (examples/adaptive_freezing) does not work correctly with the CIFAR10 dataset and the ResNet18 model.

To Reproduce

python examples/adaptive_freezing.py, but uses CIFAR10 and ResNet18 in the .yml config file (examples/adaptive_freezing_CIFAR10_resnet18.yml).

Execution terminates with error:

line 114, in update_sync_mask
self.moving_average_deltas[name][indices], deltas[indices])
IndexError: too many indices for tensor of dimension 0

Expected behavior

Run completed successfully without errors.

TypeError: unsupported operand type(s) for *: 'int' and 'NoneType'

servers/mistnet.py Line 111 self.accuracy = self.trainer.test(feature_dataset,Config().algorithm.cut_layer)
self.accuracy get None, however, trainers/trainer.py Line 256 return accurary is not none.

Major discrepancies exist between local and global test accuracies.

Even when most of the samples in the dataset are used for local training, there exist major discrepancies between local and global test accuracies. The issue exists with WideResNet as the model, but other models may exhibit similar behaviour as this issue may not be model-specific.

[Bug] Running Custom Dataset need to import tensorflow_datasets

Describe the bug
Running Custom Dataset need to import tensorflow_datasets

Screenshots
It seems that there are several unnecessary dependencies when the user configures its own dataset and load from base.Datasource.

Traceback (most recent call last):
  File "aggregate.py", line 35, in <module>
    run_server()
  File "aggregate.py", line 29, in run_server
    chooser=simple_chooser)
  File "/home/lib/sedna/service/server/aggregation.py", line 325, in __init__
    from plato.servers import registry as server_registry
  File "/home/plato/plato/servers/registry.py", line 10, in <module>
    from plato.servers import (
  File "/home/plato/plato/servers/fedavg.py", line 11, in <module>
    from plato.algorithms import registry as algorithms_registry
  File "/home/plato/plato/algorithms/registry.py", line 23, in <module>
    from plato.algorithms.tensorflow import (
  File "/home/plato/plato/algorithms/tensorflow/fedavg.py", line 7, in <module>
    from plato.datasources import registry as datasources_registry
  File "/home/plato/plato/datasources/registry.py", line 19, in <module>
    from plato.datasources.tensorflow import (
  File "/home/plato/plato/datasources/tensorflow/mnist.py", line 5, in <module>
    import tensorflow_datasets as tfds
ModuleNotFoundError: No module named 'tensorflow_datasets'

OS environment (please complete the following information):

OS: Ubuntu Linux
Version 20.04

[FR] Installation confusion with python3.6 and huge dependency.

Is your feature request related to a problem? Please describe.
I found the latest installation package for plato named plato-learn==0.2.4 at pypi, by installed it via pip, there were two confusions.

Could not find a version that satisfies the requirement plato-learn==0.2.4 in python3.6/pip21.2.4
After install via pip install "git+https://github.com/TL-System/plato.git", pip will install so many packages. As a torch user, why tensorflow2 need to be installed ?

Describe the solution you'd like

Support for python3.6 in pypi
Eliminate or reduce plato dependency in requirement

[BUG] Unable to define the custom model for client and server

Describe the bug
For the examples/base_siamese in personalizedFL sub-branch, I cannot define the custom model to be used as the input for the client and server. If the user directly instantiates the module, such as siamese_mnist_net.SiameseBase(), line 57 of plato/clients/simple.py is obviously not correct. If the user does not instantiates the module, such as siamese_mnist_net.SiameseBase, line 13 of plato/algorithms/fedavg.py utilizes the model without instantiation.

To Reproduce
Steps to reproduce the behavior:

Go to examples/base_siamese in personalizedFL sub-branch.
Run python examples/base_siamese/base_siamese.py -c examples/base_siamese/base_siamese_MNIST.yml.
You can see the upper mentioned two different errors when using different ways to define the custom model.

Expected behavior
The code is expected to run smoothly.

Additional context
The error is clear as the Client and the Server use different ways to register the model.
In the client, the line 55 of plato/clients/simple.py register the model as :

        if self.custom_model is not None:
            self.model = self.custom_model()
            self.custom_model = None

However, in the server, the line 119 of plato/servers/fedavg.py register the model as :

        if self.model is None and self.custom_model is not None:
            self.model = self.custom_model

The server in the mistnetplus example still uses socket.send() rather than send() in the base server class.

This may be problematic with virtual client IDs, used in simulated clients introduced in the current Plato release.

[BUG] Server's sampler cannot process received features from client side

Describe the bug
In the server/mistnet.py, the sampler class should be initiated with datasets who has the method num_train_examples(). After I check the code, the current version is passed the list of features directly to sample which causes the error.

To Reproduce
./run --config=configs/MNIST/mistnet_lenet5.yml

Screenshots

[RFC] In preparation for mobile clients

Isolation of project code and model data. Mobile app has to be packaged and be read-only. An option to choose where model and data be saved is recommended.
Synchronized configuration loading. Mobile app cannot load config YAML from terminal. A simple HTTP server providing the configuration is recommended.
Central logging facility. Logs from mobile client cannot be extracted easily for now. Use of log server is recommended.
Removal of unsupported requirements. datasets and zstd is not currently supported. Either we trace all their requirements and build them (not practical) or we work around it by other means.

There can be other undiscovered roadblocks for mobile clients.
Not sure if these requirements call for redesign of the project structure. I'm not experienced in mobile development either so there can be other issues.

[FR] More graceful handling of failed clients

Is your feature request related to a problem? Please describe.

When a client fails for whatever reason — most likely CUDA out of memory errors or other CUDA errors including the CUDA unknown error, currently the server will pause indefinitely waiting for the failed client to arrive.

This is sometimes what we need when debugging, but in large-scale production runs this is certainly not desirable, and it should be considered incorrect behaviour.

Describe the solution you'd like

We need a new configuration setting in the general section, called debug. When debug is false, the server should gracefully recover from a failed client by launching the client process again, and retrying the local training session with the same client ID that has previously failed. When debug is true, the server should print an error message, terminate itself, and terminate all processes. A developer can then use the -r command-line option to resume the federated learning session when needed.

Concurrency Issues on Data Downloading [BUG]

Describe the bug
For those datasets that are not shipped by torch and thus have to be manually downloaded (e.g., cinic10, multimodal_base, pascal_voc, and tiny_imagenet), they are currently downloaded as a whole (i.e., the whole training and testing datasets) in the constructors of the respective DataSource instances.

While this design may function well in the testing environment where servers and all the clients colocate in one machine, it may come across with severe concurrency issues in some situations such as that in Deploying a Plato Federated Learning Server in a Production Environment, which Plato also aims to support.

To see that, consider the two cases separately:

For the former case, it is always the server who starts to call its configure() method, and only when the call returns does the server spawns clients in the same machine. In this way, when clients call their configure() independently, none of them needs to download the dataset, again, as it is well prepared as a whole during the initialization of the server.
For the latter case, however, the server may not colocate with clients. If a remote machine (where there is no server) hosts multiple clients and these clients are concurrently initialized, then the current design implies the possibility that these clients all (1) think that the desired data is not ready at the local storage, and thus (2) download and preprocess (at least "unzip") the data concurrently. If this is the case,
1. network bandwidth/CPU cycles/memory will be wasted due to redundant work,
2. program runtime will be elongated out of the same reason, and more importantly,
3. unexpected stalls or faults may be caused for concurrent creation of the dataset at the file system.

To Reproduce
This bug should conceptually make sense. We may provide the steps for reproducing it later, if necessary.

Additional context
We spotted this bug during the development of a new feature FEMNIST. Since the solution looks like a non-trivial design problem, we prefer seeking the authors' help before working out any immature solution.

[FR] The implementation of Resnet in Plato needs to be revised

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

That is pretty strange to implement the ResNet, i.e., plato/models/resnet.py in Plato to only support the CIFAR-related dataset with input size 32. The pooling layer before the fc layer in plato is F.avg_pool2d, which is less effective.

Maybe the current implementation is to support the 'cut_layer'?
But, still, this is unnecessary to reimplement a model by ourselves as the torchvision has provided many models. Once we want to remove some layers, just use nn.Identity() to replace them without any risk.

Describe the solution you'd like
A clear and concise description of what you want to happen.

The most effective way is to follow the implementation of ResNet in torchvision. They utilize the:

    self.avgpool = nn.AdaptiveAvgPool2d((1, 1))

to support any input size.

I know the current implementation of ResNet replaces the kernel_size of conv1 7x7 with 3x3, and removes first max pooling to maintain spatial information for input with small sizes, such as 32x32 for CIFAR10.

However, this can be easily achieved by using the torchvision's implementation while setting:

  encoder.conv1 = nn.Conv2d(3,
                                  64,
                                  kernel_size=3,
                                  stride=1,
                                  padding=2,
                                  bias=False)
  encoder.maxpool = nn.Identity()

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Just reuse torchvision's implementations.

Additional context
Add any other context or screenshots about the feature request here.

In my own work, I directly utilize torchvision's implementations and revise the torchvision's model slightly based on my own requirement. It works well on many different datasets without error. See the models/encoders_register.py in the contrastive_adaptation branch.

	datasource = datasources_registry.get()
	self.model.build_model(datasource.input_shape())

	_randomize = getattr(self.trainer, "randomize", None)

	for inputs, targets, *__ in data_loader:
	with torch.no_grad():
	logits = self.model.forward_to(inputs, cut_layer)
	if epsilon is not None:
	logits = logits.detach().numpy()
	logits = unary_encoding.encode(logits)
	if callable(_randomize):
	logits = self.trainer.randomize(
	logits, targets, epsilon)
	else:
	logits = unary_encoding.randomize(logits, epsilon)

	if epsilon is not None:
	logits = logits.asnumpy()
	logits = unary_encoding.encode(logits)
	logits = unary_encoding.randomize(logits, epsilon)
	logits = mindspore.Tensor(logits.astype('float32'))

tl-system / plato Goto Github PK

plato's Issues

Approach

Recommend Projects

Recommend Topics

Recommend Org