Git Product home page Git Product logo

securefederatedai / openfl Goto Github PK

View Code? Open in Web Editor NEW
662.0 21.0 179.0 128.5 MB

An open framework for Federated Learning.

Home Page: https://openfl.readthedocs.io/en/latest/index.html

License: Apache License 2.0

Makefile 0.01% Shell 0.08% Jupyter Notebook 91.63% Python 8.28%
federated-learning secure-computation openfl distributed-computing deep-learning collaborative-learning privacy-preserving-machine-learning machine-learning fedprox fedavg

openfl's People

Contributors

aleksandr-mokrov avatar alexey-gruzdev avatar alexey-khorkin avatar bjklemme-intel avatar dependabot[bot] avatar dl8 avatar dmitryagapov avatar einse57 avatar fstrr avatar github-actions[bot] avatar igor-davidyuk avatar itrushkin avatar katerina-merkulova avatar kshannon avatar kta-intel avatar mansishr avatar maradionov avatar masterskepticista avatar operepel avatar porteratzo avatar psfoley avatar ptizzza avatar sarthakpati avatar snyk-bot avatar soda480 avatar sun1lach avatar supriya-krishnamurthi avatar tonyreina avatar viktoriiaromanova avatar walteriviera avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

openfl's Issues

Same validation metric values across collaborators in Jupyter notebooks

Describe the bug
When launching the federated training in Jupyter notebooks of openfl-tutorials folder, I noticed that different collaborators achieve the same validation metric values. That could be possible if collaborators had the same data, but the data is randomly split. Looks like there is an issue of defining the Data Loader for each collaborator.

To Reproduce
Steps to reproduce the behavior:

  1. Open openfl-tutorials/Federated_PyTorch_UNET_Tutorial.ipynb in Jupyter.
  2. Execute all cells.

Expected behavior
Collaborators have different metrics due to random data split.

Screenshot

image

Update gRPC for Tensorflow 2.4 compatibility

gRPC is currently pinned to version 1.30. Tensorflow 2.4+ requires a later version. The gRPC version was originally pinned because of sporadic network issues, but this is likely fixed with change to short lived gRPC client connection.

Redesigning Task Runner

  • Not require explicitly setting TensorKeys to track.
  • PyTorch/TensorFlow code can be imported directly into fx.
    Define easy to use python API:
    • Model Provider
    • Task registering
    • Data repository adapter
  • Implement example new task runner in Jupyter notebook
  • Build compatibility with new Data Loader interface
    Framework adapters for all currently supported frameworks
    • Tensorflow 2
    • PyTorch
  • Well defined Interface for user contributed plugins
  • Extensions to CoreTaskRunner
  • Custom task types (beyond ‘train’, ‘validate’)
  • Allow custom function arguments
  • Break tight coupling between model / dataloader to save native model to central node (aggregator) at end of experiment

Task Assigner Entity

Task Assigner:

  • the ability to assign different task to different envoys.
  • the ability to choose certain envoys to run experiment out of all connected to Federation.

FX autocomplete feature

It would be great to have fx + TAB combination (like standard Linux autocomplete) feature for current CLI.

sdist distribution of openfl

Is your feature request related to a problem? Please describe.
Currently, the packaging is on wheel only, which is fine for pip but can introduce some issues when we run into mis-matched dependency versions when packaging openfl with other packages.

Describe the solution you'd like
Adding an sdist in addition to the wheel would make things much easier.

Describe alternatives you've considered
N.A.

Additional context
N.A.

Aggregator Tasks to replace standard end of round procedure

At the end of a round, the aggregator currently calls a set sequence of functions to compute the aggregation of the collaborator models and task metrics. This aggregation procedure is highly tuned for the specific set of tasks normally called in an experiment (aggregated_model_validation, train_batches, and local_model_validation). Adding new tasks with new TensorKey tags does not always behave as expected with this rigid aggregation procedure.

We should instead provide an interface where users can add their own aggregation tasks. This goes a step beyond the current AggregationFunctionInterface, because it would be applied beyond TensorKeys marked with the ('trained',') tag, and would and could be made more general. Aggregator Tasks could further be customized to be attached to collaborator tasks, run in sequence, or run one or several at the beginning / end of round. The default set of Aggregator Tasks would execute at the end of the round. The first would compute the weighted average of metrics and report them, and the second would run aggregation on the collaborator models with compression / decompression, and the decision logic for saving the best model could be a third (this would allow easy user customization for saving a model on a metric besides best accuracy).

The exact interface for the aggregator tasks is TBD, but the tasks should be provided access to the TensorDB (read+write), the TensorCodec, and an interface to save models

Connection issue between collaborator and aggregator

In some machines, when I run a federation (even if it both the collaborators and the aggregator are on the same machine) they fail to establish a connection.
I have started facing this problem specifically in the interactive API, I have been unsuccessful in running a federation as collaborator is unable to connect to the port where aggregator has started the gRPC server.

Reproducing the error:
I did not do anything differently than what is already mentioned in the tutorial.
I created a fresh conda environment, installed the openfl library and finally tried to replicate the experiment.

We tried to debug this error and in the process we found out that the gRPC server from the aggregator runs exclusively on IPv6. Whereas, collaborator tries to listen to IPv4. We even tried to hardcode the server and the port numbers but we were unable to make it work. We suspect that the error has something to do with the way the gRPC server are started in https://github.com/intel/openfl/blob/c3c0886aefeb09f426fc3726be0f65de2b344e22/openfl/transport/grpc/server.py and https://github.com/intel/openfl/blob/c3c0886aefeb09f426fc3726be0f65de2b344e22/openfl/transport/grpc/client.py

I think this error can pose a potentially big problem in the future. Therefore, please look into it.

Thanks

Add option to log tensor_db contents on exit

OpenFL should provide an option to save a log of the TensorDB after failure or when a user hits Ctrl-C, This will make debugging aggregated values significantly easier and allow users to submit more informative issues

General configuration mechanism

I propose to create a general mechanism for receiving configurable data and exclude the default values for them from the code, and set them in default configurable file. For example, the director will take params from cli, if they are passed, otherwise from director.yaml in the director workspace, otherwise from openfl-workspace/default/director.yaml.

Set default path for step-ca/step CLI binary and certificates

Set default path for step-ca/step CLI binary and certificates (i.e. ~/.local/workspace/) with standard naming convention ('director.crt', 'envoy_one.crt', 'envoy_two.crt', etc.) so long living entities can start without always providing path for root_cert, cert, and private_key (defaults can still be overridden)

Slack Invite

Hello!

I hope whoever reading it is safe and doing great. It seems like people outside intel are unable to join the slack, can anyone give invite for people outside slack

Requirements for ShardDescriptor are installing separatly from openfl requirements.

Describe the bug
To run envoy we need to install extra requirements that are not installing with openfl.

To Reproduce
Steps to reproduce the behavior:

  1. clone openfl
  2. cd openfl
  3. pip install -e .
  4. cd openfl-tutorials/interactive_api/Director_Pytorch_Kvasir_UNET/envoy_folder
  5. source ./start_envoy.sh
  6. envoy return error

Expected behavior
envoy is working correct

Model DB

As a user I would like to keep track of my Federated Learning experiments and plot statistics of the model performance.
One implementation may be using a Model Database, such as ModelDB https://github.com/VertaAI/modeldb
This could simply plugin to our current code via the Python API. There are some nice features such as model and data versioning (Git-like) and dashboards.

FX command - make it more independent

Is your feature request related to a problem? Please describe.
During the initialization of the federation (federated environment), the fx command is very useful. However, it performs some "additional tasks", that are typically not required (or, in the future may be problematic), and need to be 'rolled-back' manually.

A list of non-necessary actions:

  1. Do not call pip install -r requirements.txt``inside fx workspace create`

    • the command fx workspace create creates workspace from a template, but it also calls pip install -r requirements.txt inside.
      For example, for template tf_2dunet, it installs TensorFlow 2.3.1, which is non-current version as of 05/2021 -- the current version is 2.4.1. So typically, there is a big chance it will roll-back already installed (and working) tensorflow version in the user's python environment to the previous version, which may not be working for him. Moreover, if the user want to change/modify his model and/or supply his own pretrained one.
  2. do not check data folders inside fx plan initialize

    • the command fx plan initialize take into consideration also data paths set in the <workspace>\plan\data.yaml file. But since the fx plan initialize is called on aggregator, and the data folders for individual clients are on a completely different computers, it must not be assumed they can be accessible from the aggregator.

Describe the solution you'd like

  1. Do not call pip ... from fx tool
  2. Do not check existence of data paths since the setup is performed on the Aggregator (which, by definition, do not have access to any data)

Parallel execution for Python API

Python native API currently executes collaborators in a sequential way, however it could be done parallel, since their execution is independent.

for round_num in range(rounds_to_train):
        for col in plan.authorized_cols:

            collaborator = collaborators[col]
            model.set_data_loader(collaborator_dict[col].data_loader)

            if round_num != 0:
                model.rebuild_model(round_num, model_states[col])

            collaborator.run_simulation()

            model_states[col] = model.get_tensor_dict(with_opt_vars=True)

Docker image - Manifest not found

Describe the bug
Trying to install docker image gives the following error:
$ sudo docker pull intel/openfl
Using default tag: latest
Error response from daemon: manifest for intel/openfl:latest not found: manifest unknown: manifest unknown

To Reproduce

  1. have a "clean" linux machine with docker installed
  2. sudo docker pull intel/openfl
  3. See error

Expected behavior
The intel/openfl image will be successfully installed

Desktop (please complete the following information):

  • OS: Linux Ubuntu 20.04
  • Docker version 20.10.6, build 370c289

Experiment order in Director

Is your feature request related to a problem? Please describe.
When experiment is set on director, director instantly creates and runs aggregator. Will be better if only one aggregator is run at one time.

Describe the solution you'd like
Create structure for experiment.
Create dict, list or queue for experiments.

Director/Envoy imperfections

Docker image - enhance description on hub.docker.com

Is your feature request related to a problem? Please describe.
Currently, there is an ambiguity of image name "openfl" in the Docker Hub, since there is product called "Open Flash Library".
The "docker pull openfl" (as stated in the documentation) will point to the openfl/openfl, which is the other product:
https://hub.docker.com/r/openfl/openfl

Finding the correct docker image intel/openfl does not provide very confidence, since there is no associated description,

Describe the solution you'd like

  1. Please add some more informative description to the Docker-Hub page of the project:
    https://hub.docker.com/r/intel/openfl
  2. please update the documentation instruction reading:
    docker pull intel/openfl

Add `port` option for fx tutorial start

Is your feature request related to a problem? Please describe.
This is more like an enhancement. Some environments don't allow you to use specific ports (for example 8888 which seems to be the default port used by Jupiter included in fx). It would be good to allow the users to specify the port they want.

Describe the solution you'd like
It would be good to allow the users to specify the port they want. There is already --ip option in fx tutorial start. Addint --port would solve the problem.

Describe alternatives you've considered

Additional context
Logs from starting the notebook

[I 11:27:36.940 NotebookApp] Serving notebooks from local directory: /home/ubuntu/anaconda3/envs/test/lib/python3.6/site-packages/openfl-tutorials
[I 11:27:36.940 NotebookApp] Jupyter Notebook 6.4.0 is running at:
[I 11:27:36.940 NotebookApp] http://aggregator:8888/?token=f78c90d7649f1166bf83df9f0f6f69c6e605494b9b4a3a23
[I 11:27:36.940 NotebookApp]  or http://127.0.0.1:8888/?token=f78c90d7649f1166bf83df9f0f6f69c6e605494b9b4a3a23
[I 11:27:36.940 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[W 11:27:36.945 NotebookApp] No web browser found: could not locate runnable browser.

Conda recipe

Is your feature request related to a problem? Please describe.
Currently, OpenFL only has the option to be installed through pip, which prevents it to be added to packages that require C/C++ libraries.

Describe the solution you'd like
A conda recipe would be very useful to mitigate this, which I am happy to work on. 😄

Describe alternatives you've considered
N.A.

Additional context
Needs #44

Dead code in run_experiment (native.py)?

Describe the bug
The "model_states" dictionary in the run_experiment function appears to be unused. Perhaps a hold-over from graph sharing?

I am able to comment out all lines involving "model_states" without any impact to the experiment that I can find.

Long-living entities: Director/Envoy

Long-Living entities

The idea behind introducing Long-Living entities is that we would explicitly separate stages of setting up a Federation (which is a set of connected nodes) and running the experiment. This allows users to set up a Federation once, with PKI exchange and setting up a correct network settings and then within a one Federation run multiple experiments.

To accomplish this goal we need to implement few more logical entities:

  • Envoy. Envoy is a long-living entity that would be run on nodes with the dataset shards.
  • Director. Director is a long-living entity that would keep and manage the registry with Envoys. Director could receive a request from the frontend API layer to start an experiment and then will start an Aggregator and send a request to Envoys to start collaborators.
  • Frontend API layer - API layer is not a long-living entity it a component that allows users to define FL experiments in a python script or a Jupyter notebook, which is equipped with a Director gRPC client to register FL task, Model, Hyper-parameters, setting up an experiment and trigger the start of an experiment execution. The frontend API layer could be run on a less performant machine such as a typical laptop since all computations would happen on Director/Envoy nodes.

For a simplified version of the proposed workflow, please refer to a picture:

image

Setup of 'fx' failed on conda 4.6.11, worked on 4.10.1

Describe the bug
When following the setup instructions using conda 4.6, the resulting conda environment failed to install openfl in the environment. While not an openfl bug, we may want to determine how to make these instructions work on older conda versions.
NOTE: I am using the 'develop' branch.

To Reproduce
Steps to reproduce the behavior:

  1. Install conda 4.6.11 on Ubuntu 18.04
  2. Follow the initial setup instructions for openfl
  3. Note that openfl has not been installed in the conda environment, thus 'fx' doesn't work. "which pip" and "which python" also don't point to the environment pip/python, either.
  4. Deactivate and remove the conda environment you created as part of openfl setup
  5. Update conda to 4.10.1
  6. Follow openfl setup instructions
  7. fx now works and is installed in the conda environment

Bug while creating local CA with https server

Bug while creating CA:

fx pki install -p </path/to/ca/dir> --ca-url <host:port>

Got an exception:

Password: 
Repeat for confirmation: 
[16:23:46] INFO     Creating CA                                                                                                                        ca.py:157
CA binaries from github will be downloaded now [Y/n]: y
[16:23:51] INFO     Downloading step-ca_linux_0.17.2_amd64.tar.gz.sig                                                                                   ca.py:57
EXCEPTION : Unknown archive format './step-ca_linux_0.17.2_amd64.tar.gz.sig'
Traceback (most recent call last):
...
  File "/home/akhorkin/.virtualenvs/openfl/bin/fx", line 8, in <module>
    sys.exit(entry())
  File "/home/akhorkin/.virtualenvs/openfl/lib/python3.8/site-packages/openfl/interface/cli.py", line 214, in entry
    error_handler(e)
  File "/home/akhorkin/.virtualenvs/openfl/lib/python3.8/site-packages/openfl/interface/cli.py", line 173, in error_handler
    raise error
  File "/home/akhorkin/.virtualenvs/openfl/lib/python3.8/site-packages/openfl/interface/cli.py", line 212, in entry
    cli()
  File "/home/akhorkin/.virtualenvs/openfl/lib/python3.8/site-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/home/akhorkin/.virtualenvs/openfl/lib/python3.8/site-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/home/akhorkin/.virtualenvs/openfl/lib/python3.8/site-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/akhorkin/.virtualenvs/openfl/lib/python3.8/site-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/akhorkin/.virtualenvs/openfl/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/akhorkin/.virtualenvs/openfl/lib/python3.8/site-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/home/akhorkin/.virtualenvs/openfl/lib/python3.8/site-packages/openfl/interface/pki.py", line 64, in install_
    install(ca_path, ca_url, password)
  File "/home/akhorkin/.virtualenvs/openfl/lib/python3.8/site-packages/openfl/component/ca/ca.py", line 168, in install
    download_step_bin(url, 'step-ca_linux', 'amd', prefix=ca_path, confirmation=False)
  File "/home/akhorkin/.virtualenvs/openfl/lib/python3.8/site-packages/openfl/component/ca/ca.py", line 59, in download_step_bin
    shutil.unpack_archive(f'{prefix}/{name}', f'{prefix}/step')
  File "/usr/local/lib/python3.8/shutil.py", line 1223, in unpack_archive
    raise ReadError("Unknown archive format '{0}'".format(filename))
shutil.ReadError: Unknown archive format './step-ca_linux_0.17.2_amd64.tar.gz.sig'

Desktop:

  • OS: Ubuntu 18.04.5 LTS
  • Python: Python 3.8.5

Expected behavior:

  • Need to download and unpack .zip archive

For 3D Image data

Hi there,

I am trying to process some 3D medical images (some .nii.gz files) with openFL but I am having some trouble doing so. My data loader is as follows: (data loader from 3D_unet model)

def get_dataset(self):

    self.num_train = int(self.numFiles * self.train_test_split)
    numValTest = self.numFiles - self.num_train
    ds = tf.data.Dataset.range(self.numFiles).shuffle(
        self.numFiles, self.random_seed)  # Shuffle the dataset
    ds_train = ds.take(self.num_train).shuffle(
        self.num_train, self.shard)  # Reshuffle based on shard
    ds_val_test = ds.skip(self.num_train)
    self.num_val = int(numValTest * self.validate_test_split)
    self.num_test = self.num_train - self.num_val
    ds_val = ds_val_test.take(self.num_val)
    ds_test = ds_val_test.skip(self.num_val)

    ds_train = ds_train.map(lambda x: tf.py_function(self.read_nifti_file,
                                                     [x, True], [tf.float32, tf.float32]),
                            num_parallel_calls=tf.data.experimental.AUTOTUNE)
    ds_val = ds_val.map(lambda x: tf.py_function(self.read_nifti_file,
                                                 [x, False], [tf.float32, tf.float32]),
                        num_parallel_calls=tf.data.experimental.AUTOTUNE)
    ds_test = ds_test.map(lambda x: tf.py_function(self.read_nifti_file,
                                                   [x, False], [tf.float32, tf.float32]),
                          num_parallel_calls=tf.data.experimental.AUTOTUNE)

    ds_train = ds_train.repeat()
    ds_train = ds_train.batch(self.batch_size)
    ds_train = ds_train.prefetch(tf.data.experimental.AUTOTUNE)

    batch_size_val = 4
    ds_val = ds_val.batch(batch_size_val)
    ds_val = ds_val.prefetch(tf.data.experimental.AUTOTUNE)

    batch_size_test = 1
    ds_test = ds_test.batch(batch_size_test)
    ds_test = ds_test.prefetch(tf.data.experimental.AUTOTUNE)

    return ds_train, ds_val, ds_test

, which output some PrefetchObjects ds_train, ds_val and ds_test. However, according to the data loader file, I believe OpenFL is expecting data loaders to outputs X_train, y_train, X_valid, y_valid, and some follow-up operations (e.g., get batch) with them. I personally found it easier if we can have an option to use the PrefetchObjects directly instead of converting them to X_train, y_train etc.


So I was wondering if OpenFL can have some ways to enable data loaders for the nii.gz files?

Thank you so much for your attention!

Regarding aggregation in openfl

Hi,

If we want to use privacy preserving technologies such as differential privacy in securing aggregation in openfl, are there any tutorials or class interfaces which we can override to include the added security?

Are there any tutorials or class interfaces in openfl in which custom aggregation algorithms can be included other than federated averaging? Edit: I just realize there is new documentation added to custom averaging at https://openfl.readthedocs.io/en/latest/overriding_agg_fn.html

Thanks.

Client training performance of a container always better than others

I'm a student and trying to understand this FL framework. I've started with notebook tutorials, like the one called "new_python_api_Tensorflow_MNIST.ipynb". I've run the scenario with 2 collaborators, each one in a respective container. I noticed that a collaborator always (more than 10 run) has better accuracy metrics better than another one even if:

  • the collaborators both have the same performance (ram, cpus, etc);
  • there is a shuffle of mnist dataset elements and the do_sharding method which assigns different portions of the whole dataset to clients;
  • the collaborators both uses the same model.

Could this be a feature of this FL framework that I don't know or maybe I haven't understand well how the learning phases works? Thanks for your patience and support.

[Keras_MNIST_Tutorial]: ValueError: Aggregator does not have an aggregated tensor for TensorKey

Hi all, I build the openfl docker image (with current master), and I'm trying Keras Mnist tutorial with a docker container. However, currently, I get the following error:

final_fl_model = fx.run_experiment(collaborators,override_config={'aggregator.settings.rounds_to_train':5})
  File "/usr/local/lib/python3.8/dist-packages/openfl/native/native.py", line 297, in run_experiment
    collaborator.run_simulation()
  File "/usr/local/lib/python3.8/dist-packages/openfl/component/collaborator/collaborator.py", line 147, in run_simulation
    self.do_task(task, round_number)
  File "/usr/local/lib/python3.8/dist-packages/openfl/component/collaborator/collaborator.py", line 192, in do_task
    input_tensor_dict = self.get_numpy_dict_for_tensorkeys(
  File "/usr/local/lib/python3.8/dist-packages/openfl/component/collaborator/collaborator.py", line 214, in get_numpy_dict_for_tensorkeys
    return {k.tensor_name: self.get_data_for_tensorkey(k) for k in tensor_keys}
  File "/usr/local/lib/python3.8/dist-packages/openfl/component/collaborator/collaborator.py", line 214, in <dictcomp>
    return {k.tensor_name: self.get_data_for_tensorkey(k) for k in tensor_keys}
  File "/usr/local/lib/python3.8/dist-packages/openfl/component/collaborator/collaborator.py", line 290, in get_data_for_tensorkey
    nparray = self.get_aggregated_tensor_from_aggregator(
  File "/usr/local/lib/python3.8/dist-packages/openfl/component/collaborator/collaborator.py", line 328, in get_aggregated_tensor_from_aggregator
    tensor = self.client.get_aggregated_tensor(
  File "/usr/local/lib/python3.8/dist-packages/openfl/component/aggregator/aggregator.py", line 334, in get_aggregated_tensor
    raise ValueError("Aggregator does not have an aggregated tensor"
ValueError: Aggregator does not have an aggregated tensor for TensorKey(tensor_name='dense_3/kernel:0', origin='aggregator_plan.yaml_a379411e', round_number=0, report=False, tags=('model',))

Working environment in docker container

  • Python 3.8.0
  • intel-tensorflow 2.3.0

To Reproduce

  • clone the openfl repos, and build image: ./scripts/build_base_docker_image.sh
  • Run docker: docker run -p 8888 -it --rm openfl:latest bash
  • Try out the turorial: openfl-tutorials/Federated_Keras_MNIST_Tutorial.ipynb

I think, there is an issue here: https://github.com/intel/openfl/blob/1aa2b16509a1a9a97983760a45aa1e5f133e9e30/openfl/native/native.py#L288
since the model of the first collaborator model was not initialized as the last collaborator: https://github.com/intel/openfl/blob/1aa2b16509a1a9a97983760a45aa1e5f133e9e30/openfl/native/native.py#L259
In addition, there is a sallow copy for the plan of each collaborator: https://github.com/intel/openfl/blob/develop/openfl/native/native.py#L206
Could you please help to check?
Thank you!

DefaultCPUAllocator: can't allocate memory: you tried to allocate 451477504 bytes.

Describe the bug
The collaborator gives the following error EXCEPTION : [enforce fail at CPUAllocator.cpp:65] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 451477504 bytes. Error code 12 (Cannot allocate memory) while running the 'New Interactive Python API (experimental) notebook' (Pytorch using kvasir dataset).
After successfully running round 0, the collaborators receive the new weights from the aggregator and one of the collaborators crashes with the above error. Even though we have checked our RAM and disk memory, it was sufficient before this exception.

To Reproduce
Steps to reproduce the behaviour:

  1. We are running Aggregator and 2 Collaborators on distributed machines.
  2. On Aggregator we run the python code interactively
  3. On Collaborator1 we run this command fx collaborator start -d data.yaml -n one
  4. On Collaborator2 we run this command fx collaborator start -d data.yaml -n two

Expected behavior
The model should run more rounds.

Screenshots

Screen1

Desktop (please complete the following information):

  • OS: [ubuntu]
  • Browser [18.04]
  • Version [Python 3.7.6]
  • OpenFL Version [1.1]

Setting-up the federation -- fx plan init -- does not work

Describe the bug
I am trying to setup a federation based on the '' following the documentation written here

The problem is, that the command fx plan initialize (as mentioned in the point 7) fails due to the checks for non-existing data folders. In default setup, it looks for path (which seems to be some 'leftovers' from your development environment), and even after specifying the local paths, it tries to look for them somewhere else.

To Reproduce

Steps to reproduce the behavior:

  1. Fresh install (windows or Linux machine), fresh conda environment (named 'open-fl'), installed pip openfl package
    • on Windows installed pip package from source (branch develop, commit: 0412c82)
    • fx command is running
  2. chosen template tf_2dunet
  3. Setting some custom configuration:
    export WORKSPACE_TEMPLATE=tf_2dunet
    export WORKSPACE_PATH=${HOME}/projects/my-work/openfl-federations/federation_0.2
  4. changing directory to cd ${WORKSPACE_PATH}
  5. Running:
    fx workspace create --prefix ${WORKSPACE_PATH} --template ${WORKSPACE_TEMPLATE}
    • the command finishes susscessfully - the workspace is created, the requirements from requirements.txt are installed via pip.
  6. running pip install -r requirements.txt manually, as mentioned in the point 6, of the tutorial is not necessary
    => I would suggest that fx command will not update pip requirements.
  7. running command fx plan initialize ends with the error:
    EXCEPTION : [Errno 2] No such file or directory: "'/raid/datasets/BraTS17/by_institution_NIfTY/1'"
    • please see screenshot 1 below - screenshot from Linux machine
    • error log:

EXCEPTION : [Errno 2] No such file or directory: "'/raid/datasets/BraTS17/by_institution_NIfTY/1'"
Traceback (most recent call last):
File "/home/rstoklas/miniconda3/envs/open-fl/bin/fx", line 8, in
sys.exit(entry())
File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/openfl/interface/cli.py", line 194, in entry
error_handler(e)
File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/openfl/interface/cli.py", line 155, in error_handler
raise error
File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/openfl/interface/cli.py", line 192, in entry
cli()
File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/click/decorators.py", line 21, in new_func
return f(get_current_context(), *args, **kwargs)
File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/openfl/interface/plan.py", line 78, in initialize
task_runner = plan.get_task_runner(collaborator_cname)
File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/openfl/federated/plan/plan.py", line 298, in get_task_runner
defaults[SETTINGS]['data_loader'] = self.get_data_loader(
File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/openfl/federated/plan/plan.py", line 286, in get_data_loader
self.loader
= Plan.Build(**defaults)
File "/home/rstoklas/miniconda3/envs/open-fl/lib/python3.8/site-packages/openfl/federated/plan/plan.py", line 173, in Build
instance = getattr(module, class_name)(**settings)
File "/home/rstoklas/projects/my-work/openfl-federations/federation_0.2/code/tfbrats_inmemory.py", line 29, in init
X_train, y_train, X_valid, y_valid = load_from_NIfTI(parent_dir=data_path,
File "/home/rstoklas/projects/my-work/openfl-federations/federation_0.2/code/brats_utils.py", line 93, in load_from_NIfTI
subdirs = os.listdir(path)
FileNotFoundError: [Errno 2] No such file or directory: "'/raid/datasets/BraTS17/by_institution_NIfTY/1'"

  1. Even when I change paths in the plan/data.yaml to point to the existing directories, it fails:
    • Exception:
    • Please see screenshot #2
    • Error message:

{'01-win': 'data/client-01', '02-pegas': 'data/client-02', '03-pegas': 'data/client-03'}
INFO Building 🡆 Object TensorFlowBratsInMemory from code.tfbrats_inmemory Module. plan.py:168
INFO Settings 🡆 {'batch_size': 64, 'percent_train': 0.8, 'collaborator_count': 2, 'data_group_name': 'brats', 'data_path': plan.py:171
'data/client-01'}
INFO Override 🡆 {'defaults': 'plan/defaults/data_loader.yaml'} plan.py:173
EXCEPTION : need at least one array to concatenate
Traceback (most recent call last):
File "c:\anaconda3\envs\open-fl\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "c:\anaconda3\envs\open-fl\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\Anaconda3\envs\open-fl\Scripts\fx.exe_main
.py", line 7, in
File "c:\anaconda3\envs\open-fl\lib\site-packages\openfl\interface\cli.py", line 194, in entry
error_handler(e)
File "c:\anaconda3\envs\open-fl\lib\site-packages\openfl\interface\cli.py", line 155, in error_handler
raise error
File "c:\anaconda3\envs\open-fl\lib\site-packages\openfl\interface\cli.py", line 192, in entry
cli()
File "c:\anaconda3\envs\open-fl\lib\site-packages\click\core.py", line 829, in call
return self.main(*args, **kwargs)
File "c:\anaconda3\envs\open-fl\lib\site-packages\click\core.py", line 782, in main
rv = self.invoke(ctx)
File "c:\anaconda3\envs\open-fl\lib\site-packages\click\core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "c:\anaconda3\envs\open-fl\lib\site-packages\click\core.py", line 1259, in invoke
return process_result(sub_ctx.command.invoke(sub_ctx))
File "c:\anaconda3\envs\open-fl\lib\site-packages\click\core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "c:\anaconda3\envs\open-fl\lib\site-packages\click\core.py", line 610, in invoke
return callback(*args, **kwargs)
File "c:\anaconda3\envs\open-fl\lib\site-packages\click\decorators.py", line 21, in new_func
return f(get_current_context(), *args, **kwargs)
File "C:\Anaconda3\envs\open-fl\Lib\site-packages\openfl\interface\plan.py", line 77, in initialize
data_loader = plan.get_data_loader(collaborator_cname)
File "c:\anaconda3\envs\open-fl\lib\site-packages\openfl\federated\plan\plan.py", line 293, in get_data_loader
self.loader
= Plan.Build(**defaults)
File "c:\anaconda3\envs\open-fl\lib\site-packages\openfl\federated\plan\plan.py", line 179, in Build
instance = getattr(module, class_name)(**settings)
File "C:\Users\rstoklas\cernbox\work\my-projects\FL-phase-3_network\federation-0.1\code\tfbrats_inmemory.py", line 29, in init
X_train, y_train, X_valid, y_valid = load_from_NIfTI(parent_dir=data_path,
File "C:\Users\rstoklas\cernbox\work\my-projects\FL-phase-3_network\federation-0.1\code\brats_utils.py", line 125, in load_from_NIfTI
imgs_train = np.concatenate(imgs_all_train, axis=0)
File "<array_function internals>", line 5, in concatenate
ValueError: need at least one array to concatenate

Expected behavior

  1. I would expect that all steps in the tutorial will succeed
  2. I would expect that at the end of the tutorial, I will end up with a working federated environment
  3. I would expect that the setup tools will not require access to the data (since the setup is performed on the aggregator, and the data are on the nodes, to which the aggregator does not have access to)

Screenshots
If applicable, add screenshots to help explain your problem.

Error with defaults paths:
2021-05-07 - OpenFL plan init failed

Error with modified and correct paths:
2021-05-07 - OpenFL plan init failed-2

Desktop (please complete the following information):

  • OS: Ubuntu 20.04, Windows 10

Scalable PKIs

In avoid manual copying for current PKIs/certificate exchange between Aggregator & Collaborators we need an automatic system for that.

File not found issue when doing fx plan initialize

Hi there,

I was trying to do fx plan initialize but encountered some error message with the data loader as follows:

fx workspace create --prefix ${HOME}/2dunet --template tf_2dunet
pip install -r requirments.txt
fx plan initialize
FileNotFoundError: [Errno 2] No such file or directory: "'/raid/datasets/BraTS17/by_institution_NIfTY/1'"

Nevertheless, we also tried with files and directories that are valid and must exist there such as '/home/', but the error massage is still there, saying No such file or directory: "'/home/'"

openfl package not found

Describe the bug
Following the OpenFL documentation instructions here but I get the following error when running pip install openfl:

ERROR: Could not find a version that satisfies the requirement openfl
ERROR: No matching distribution found for openfl

-OS: MacOS Big Sur

Create tests for tutorials' notebooks

There are no tests for openfl-tutorials/interactive_api.
Its have core scenarious for openfl and if some change broke this functionality it sould be fixed as soon as possible. Becouse it is the entry point for new users. And if something was broken here the user can deside that all library isn't working.
It would be greate to create environment for this notebooks and run it on CI.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.