nvidia / nvflare Goto Github PK

View Code? Open in Web Editor NEW

563.0 18.0 152.0 40.11 MB

NVIDIA Federated Learning Application Runtime Environment

Home Page: https://nvflare.readthedocs.io/en/main/index.html

License: Apache License 2.0

Shell 0.37% Python 91.29% HTML 0.27% Dockerfile 0.01% Jupyter Notebook 7.74% JavaScript 0.01% CMake 0.03% C++ 0.28%

python

nvflare's Introduction

NVIDIA FLARE

NVIDIA FLARE (NVIDIA Federated Learning Application Runtime Environment) is a domain-agnostic, open-source, extensible SDK that allows researchers and data scientists to adapt existing ML/DL workflows to a federated paradigm. It enables platform developers to build a secure, privacy-preserving offering for a distributed multi-party collaboration.

Features

FLARE is built on a componentized architecture that allows you to take federated learning workloads from research and simulation to real-world production deployment.

Application Features

Support both deep learning and traditional machine learning algorithms (eg. PyTorch, TensorFlow, Scikit-learn, XGBoost etc.)
Support horizontal and vertical federated learning
Built-in Federated Learning algorithms (e.g., FedAvg, FedProx, FedOpt, Scaffold, Ditto, etc.)
Support multiple server and client-controlled training workflows (e.g., scatter & gather, cyclic) and validation workflows (global model evaluation, cross-site validation)
Support both data analytics (federated statistics) and machine learning lifecycle management
Privacy preservation with differential privacy, homomorphic encryption, private set intersection (PSI)

From Simulation to Real-World

FLARE Client API to transition seamlessly from ML/DL to FL with minimal code changes
Simulator and POC mode for rapid development and prototyping
Fully customizable and extensible components with modular design
Deployment on cloud and on-premise
Dashboard for project management and deployment
Security enforcement through federated authorization and privacy policy
Built-in support for system resiliency and fault tolerance

Take a look at NVIDIA FLARE Overview for a complete overview, and What's New for the lastest changes.

Installation

To install the current release:

$ python3 -m pip install nvflare

Getting Started

You can quickly get started using the FL simulator. A detailed getting started guide is available in the documentation.

Examples and notebook tutorials are located at NVFlare/examples.

Community

We welcome community contributions! Please refer to the contributing guidelines for more details.

Ask and answer questions, share ideas, and engage with other community members at NVFlare Discussions.

Related Talks and Publications

Take a look at our growing list of talks, blogs, and publications related to NVIDIA FLARE.

License

NVIDIA FLARE is released under an Apache 2.0 license.

nvflare's People

Contributors

Stargazers

Watchers

Forkers

isaacyangsla madil90 yhwen drbeh yuantinghsieh jamindy frenz86 holgerroth yanchengnv chomolungma stjordanis ziyuexu77 npann xxlya nloppi devhliu taleinat nvkevlu syangster pxli frankfanslc vkullu nintorac nvidianz warrentseng jooycelinn asus-ocis arnovanhilten js-ts wyli python-repository-hub fisayoadeyemi shism2 chesterxgchen byeonggichae coolbigdataguy yanxuanliu takeshineshiro cloudrainstar parkeraddison edsun3941 answerdigital world-worst-detector whuhxb mirsci yiheng-wang-nv lsnyd rmadamson snapbuy eordentlich mayi140611 c00cjz00 siomvas pnunes14 mgechols rongou can-zhao 5l1v3r1 josephtlucas jeffwan phoenixdigitalfx wonjoon-yun hshantala ajulyav kuihao kumoliu yodck kkersten aistudio-ml jjalko a-parida12 garybrims calvin89029 aymanjeb guopengf john-cipherome dzanidr sertugf chesterguan sourcegraph-ce joolstorrentecalo xahram spebern oliversaldanha25 araram96 amyseoj1 harishnference falibabaei elicharlese dalmouiee wangxiaoyunnv vananle jeremy313 asbaharoon java-cds-club williamlindskog tmu2023 bpdanek xander-aphe-hatschi steffessonatype

nvflare's Issues

Update streaming widgets docstrings and clean up

The docstrings of the streaming widgets needs to be updated, this is a follow up of #117 .

multi_gpu option in client sub_start.sh

This option in sub_start.sh is deprecated. We have to remove that from the template file.

enhance the Learner_Executor to check the Learner return code

When the client training task got aborted, or run into error, it may return Shareable with ReturnCode not be OK.

Update TensorBoard streaming test_apps

Since fed_event does not guaranteed to be sent, we should not enforce that the server side has client TensorBoard files.

Errors in streaming.py

@yanchengnv notice some issues in nvflare/app_common/widgets/streaming.py:

Line 47 to Line 52, the checking of the args and error messages are wrong.
All these write_xxx() methods, should check the tag and data arg and make sure they are what we expect (str, dict, …)
Line 257, in the call self.log_xxx(), we should set send_event=False; otherwise it may cause recursive events
Since fed events are handled by a separate thread, there is a potential racing condition that a fed event could be fired after END_RUN event. In the Receiver code, we need to make sure to discard other events after END_RUN (and hence finalize) is done.

Create workflow to automatically build docs and deploy them to the pages site

Deploying a new version of the documentation to the pages site is now a manual process requiring a rebuild. Using a workflow, the docs can be rebuilt and deployed automatically.

shell scripts missing x permission in poc

The shell script files generated from poc command do not have original permission settings, especially the execute permission, after switching to shutil.unpack_archive.

Deploying FL on multiple computers.

I am trying to run NVFlare as a realistic setup with multiple computers. After the provisioning steps, I ran the server and clients, admin by startup package. The sever is started but the client and admin computers yielded the communication error.

2022-01-05 21:37:08,624 - Communicator - ERROR - Action: client_registration grpc communication error. retry: 1500, First start till now: 0.0013239383697509766 seconds.
2022-01-05 21:37:08,624 - Communicator - ERROR - Could not connect to server: imtl-85545-3:8765 Setting flag for stopping training. failed to connect to all addresses

I try listing up the listening ports on the server by the nmap and it showed up 127.0.1.1:8002 which means the server is listening only to the localhost but not another computer. This makes me wonder whether the current NVFlare support running realistic scenario or only POC (prove of concept) ? Please help me to solve this problem, thank you.

NVFlare python version not compatible with Google colab or Google Vertex AI Notebooks

NVFlare requires python 3.8.10 or higher per the pypi page, and both Google Colab and Google Vertex AI Notebooks currently run python 3.7.12 and 3.7.10 respectively. Upgrading these environments is relatively undocumented and complex.

For reference PyTorch 1.10 works with python 3.5 or greater.

Can the dependencies on python 3.8.10 be reduced so that python 3.7 will suffice?

Tenseal dependency for HE is not available on ARM aarch64

The tenseal dependency is not available for the ARM aarch64 platform, causing installation to fail. This has been reported for local development on Mac M1 and will affect other non-x86 architectures, Jetson, Clara AGX, IBM POWER, etc..

The tenseal dependency is only required when using the HEBuilder module, and it looks like all other functionality could be used without this dependency. Can tenseal be made optional, with the caveat that HE is not available without tenseal?

One option would be providing an alternate install, a requirements-no-tenseal.txt that includes everything but tenseal. For example, I generated this file in a clean venv on my linux machine using:

pip download nvflare -d /tmp -v \
    | grep Collecting \
    | awk '{print $2}' \
    | tr '[:upper:]' '[:lower:]' \
    | grep -v tenseal \
    | tee requirements-no-tenseal.txt

and verified that I can install nvflare and all deps except tenseal by copying to an aarch64 system (in this case a Jetson TX2) with:

python3 -m pip install --no-deps -r requirements-no-tenseal.txt

This is a pretty awkward solution. It would be much cleaner to remove the tenseal dependency in the default packaging, since HE is optional, and note in the docs that tenseal must be installed when using HE.

Prostate example

Add multi-site prostate example to show monai usage, fedprox algorithm, and non-iid FL scenario

Setup Contributing guide & codeowners

License header in source files is outdated.

The copyright year in license header needs to include this year, 2022. Therefore the first line should change to

to include year 2022.

Setup pre-merge & post-merge CI

Add license headers into .py files inside test folder

The python code inside test folder is missing license header, we need to add them

Create Learner API & add LearnerExecutor support

Create Learner API & add LearnerExecutor to support "train", "submit_model" and "validate" executor task.

Improve documentation in app_common

The args.log_config is not set up properly

The args.log_config is not set up properly in the worker_process.py

Multiple FL servers on the same machine

When running multi FL servers on the same machine, even with their individual ports for admin and client communications, the secure grpc communication encounters issues:

E0120 10:25:20.267690287 12242 ssl_transport_security.cc:1468] Handshake failed with fatal error SSL_ERROR_SSL: error:1000007d:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED.

It seemed /tmp/fl_server contained only one of the multiple FL servers configurations.

Workflow for automatically building documentation is not working for apidocs

The apidocs are being omitted from being checked in because of the following line in .gitignore:

docs/apidocs/nvflare.*

Since the workflow is automatically using what is checked out from the main branch to run the docs build, the .gitignore is being used and the generated apidocs html files are not checked into the docs branch and thus they do not make it to pages.

Add a TensorBoard metric streaming example

Show how to stream metrics during training to the server and create central tensorboard event files.

Fixed the hardcode port when running MultiProcessEecutor

Example READMEs

Add example readmes.

Brats example for Monai usage

Add brats example to show Monai usage

Add logging streaming support

To add the support for streaming the logging data from the client to the server.

Add SCAFFOLD example

Add CIFAR-10 example of "SCAFFOLD: Stochastic Controlled Averaging for Federated Learning" (https://arxiv.org/abs/1910.06378)

Using EventRecorder and log streaming together caused the logging loop

Improve examples

hello examples are not as refined as cifar10 example. Improve all examples so they're of same quality.

Change NVFlare in the text / code to NVIDIA Flare

We need to change NVFlare to NVIDIA Flare ("NVFlare") to be more precise.

poc command is not found

Hi,

When I follow the instruction at https://nvidia.github.io/NVFlare/quickstart.html, after the installation of the nvflare through pip, there is no command poc found in the virtualenv. is there anything wrong or missing?

Deploy command

Hi there,

when I tried to get hello-monai deployed with deploy_app hello-monai as stated in the README, I get an error.
It works, if I add either client or server behind the command:

deploy_app hello-monai server
or
deploy_app hello-monai client

Remove the LocalLogger dependency on LogSender

CIFAR10 run_fl.py misses license header

NVFlare/examples/cifar10/run_fl.py

Line 1 in d784e7b

import argparse

Missing the END_RUN logging in log streaming

When using the LogAnalyticsSender to streaming the logging data to the server, the END_RUN logging data are missing sent to the server.

Document page states NVFlare only compatible with one single Python version

Previously, NVFlare 1.X was compatible with (and ran on) Python 3.8.10 due to the pip package was released with pyc files only. Those pyc files were compiled by Python 3.8.10 interpreter and thus must run in Python 3.8.10 environment.
In NVFlare 2.x, the pip packages are source codes, in stead of pyc files. Therefore, the original statement may cause confusion.

Time lag on fed events

The server side fed event runner can handle 10 events per sec. When lots of fed events are coming, it could take too long to process all of them.

Coding style in files in nvflare/private folder

isort and black show issues on some files in nvflare/private.

admin command "sys_info client" error

admin command "sys_info client" result with error stack_trace.

File "/opt/conda/lib/python3.8/site-packages/nvflare/fuel/hci/server/reg.py", line 104, in process_command
self._do_command(conn, command)
File "/opt/conda/lib/python3.8/site-packages/nvflare/fuel/hci/server/reg.py", line 92, in _do_command
handler(conn, args)
File "/opt/conda/lib/python3.8/site-packages/nvflare/private/fed/server/sys_cmd.py", line 66, in sys_info
self._process_replies(conn, replies)
File "/opt/conda/lib/python3.8/site-packages/nvflare/private/fed/server/sys_cmd.py", line 77, in _process_replies
conn.append_string("Client " + r.client_name)
AttributeError: 'ClientReply' object has no attribute 'client_name'

nvflare does not have version attribute

It's common to set the version attribute to the package's version information. nvflare currently does not have such attribute. Requested by users.

Use Learner API for examples

NVFLARE now defines a Learner class and a built-in executor that can work with a Learner implementation. Federated deep learning apps should be written as Learners instead of Executors.

Currently all examples use Executors, please change to use Learner API.

Remove the multi_gpu setting for the POC command

No module named 'pt'

Hi there,

I am trying to get the cifar10 example running with Federated Learning.

I followed all the steps mentioned here https://nvidia.github.io/NVFlare/quickstart.html and then uploaded and deployed the app from the admin terminal. When I am trying to start the app, I am getting the following error:

./run_1/app_server/config/config_fed_server.json in JSON element components.#5: No module named 'pt'

Event though, in the run_1 folder of the clients is a folder called pt with the specified learners. Do I have to configure the path to the custom folders somewhere?

Fix the looping streaming logging error

Error in pt_file_model_persistor.py

I am using NVFLare version 2.0.6
However, when I starting the app on my system (includes 4 clients), the server got error like this:

2022-01-27 04:48:10,374 - ServerRunner - ERROR - [run=1]: Aborting current RUN due to FATAL_SYSTEM_ERROR received: expect model to be torch.nn.Module but got <class 'dict'>
2022-01-27 04:48:10,374 - ServerRunner - INFO - [run=1]: asked to abort - triggered abort_signal to stop the RUN
2022-01-27 04:48:10,374 - ServerRunner - INFO - [run=1]: starting workflow scatter_gather_ctl (<class 'nvflare.app_common.workflows.scatter_and_gather.ScatterAndGather'>) ...
2022-01-27 04:48:10,374 - ScatterAndGather - INFO - [run=1]: Initializing ScatterAndGather workflow.
2022-01-27 04:48:10,374 - PTFileModelPersistor - ERROR - [run=1]: error getting state_dict from model object
Traceback (most recent call last):
  File "/home/jupyter-test/.conda/envs/fl/lib/python3.8/site-packages/nvflare/app_common/pt/pt_file_model_persistor.py", line 202, in load_model
    data = self.model.state_dict() if self.model is not None else OrderedDict()
AttributeError: 'dict' object has no attribute 'state_dict'
2022-01-27 04:48:10,374 - ServerRunner - ERROR - [run=1]: Aborting current RUN due to FATAL_SYSTEM_ERROR received: cannot create state_dict from model object
2022-01-27 04:48:10,374 - ServerRunner - INFO - [run=1]: asked to abort - triggered abort_signal to stop the RUN
2022-01-27 04:48:10,375 - ServerRunner - INFO - [run=1]: Workflow scatter_gather_ctl (<class 'nvflare.app_common.workflows.scatter_and_gather.ScatterAndGather'>) started
2022-01-27 04:48:10,375 - ScatterAndGather - INFO - [run=1, wf=scatter_gather_ctl]: Beginning ScatterAndGather training phase.
2022-01-27 04:48:10,375 - ScatterAndGather - INFO - [run=1, wf=scatter_gather_ctl]: Abort signal received. Exiting at round 0.
2022-01-27 04:48:10,375 - ServerRunner - INFO - [run=1, wf=scatter_gather_ctl]: Workflow: scatter_gather_ctl finalizing ...
2022-01-27 04:48:12,877 - ServerRunner - INFO - [run=1, wf=scatter_gather_ctl]: ABOUT_TO_END_RUN fired
2022-01-27 04:48:12,877 - ServerRunner - INFO - [run=1, wf=scatter_gather_ctl]: END_RUN fired
2022-01-27 04:48:12,878 - ServerRunner - INFO - [run=1, wf=scatter_gather_ctl]: Server runner finished.
2022-01-27 04:48:13,376 - FederatedServer - INFO - Server app stopped.

Please help me resolving this problem, thank you.