Git Product home page Git Product logo

vermasrijan / srijan-gsoc-2020 Goto Github PK

View Code? Open in Web Editor NEW
17.0 1.0 5.0 134.5 MB

Healthcare-Researcher-Connector Package: Federated Learning tool for bridging the gap between Healthcare providers and researchers

License: MIT License

Jupyter Notebook 89.74% Python 10.26%
federated-learning bioinformatics gsoc-2020 pygrid pysyft openmined differential-privacy multiparty-computation gtex covid-19

srijan-gsoc-2020's Introduction

Healthcare-Researcher-Connector (HRC) Package:

A Federated Learning repository for simulating decentralized training for common biomedical use-cases

Build Status contributions welcome GitHub license

Table of Contents

About

  • Quality information exist as islands on gadgets like cell phones and PCs over the globe and are protected by severe security safeguarding laws.
  • Federated Learning gives an astute methods for associating AI models to these incoherent information paying little heed to their areas, and all the more significantly, without penetrating protection laws.
  • In biomedical research, sharing and use of human biomedical data is also heavily restricted and regulated by multiple laws. Such data-sharing restrictions allow keeping privacy of the patients but at the same time it impedes the pace of biomedical research, slows down the development of treatments of various diseases and often costs human lives.
  • COVID-19 pandemic is unfortunately a good illustration of how inaccessibility of clinical training data leads to casualties that can be otherwise avoided.
  • This repository is devoted to addressing this issue for the most common biomedical use-cases, like gene expression data.

Intent

  • It is an introductory project for simulating easy-to-deploy Federated Learning, for decentralized biomedical datasets.
    • A user can either simulate FL training locally (using localhost), or remotely (on several machines).
    • A user can also compare centralized vs decentralized train metrics.
  • Technology Stack used:
  • Example Dataset used:
    • GTEx: The Common Fund's Genotype-Tissue Expression (GTEx) Program established a data resource and tissue bank to study the relationship between genetic variants (inherited changes in DNA sequence) and gene expression (how genes are turned on and off) in multiple human tissues and across individuals.

GSoC Blog Post

Installation and Initialization

  • NOTE: All the testing has been done on a MacOS / Linux based system
  • Step 1: Install Docker & Docker-Compose, and pull required images from DockerHub
    1. To install Docker, just follow the docker documentation.
    2. To install Docker-Compose, just follow the docker-compose documentation.
    3. Start your docker daemon
    4. Pull grid-node image : docker pull srijanverma44/grid-node:v028
    5. Pull grid-network image : docker pull srijanverma44/grid-network:v028
    • Image size of grid-node ~= 2GB, and that of grid-network ~= 300MB. That is, image sizes are large!
    • NOTE: These images have been taken from OpenMined Stack. Refer PySyft & PyGrid repositories for more details!
  • Step 2: Install dependencies via conda
    1. Install Miniconda, for your operating system, from https://conda.io/miniconda.html
    2. git clone https://github.com/vermasrijan/srijan-gsoc-2020.git
    3. cd srijan-gsoc-2020
    4. conda env create -f environment.yml
    5. conda activate pysyft_v028 (or source activate pysyft_v028 for older versions of conda)
  • Step 3: Install GTEx V8 Dataset
    • Pull samples and expressions data using the following command:
dvc pull
  • The above command will download GTEx samples + expressions data inside data/gtex directory, from Google Drive remote repository.
  • Initially, you may be prompted to enter a verification code, i.e., you'll have to give DVC an access to your Google Drive API.
  • For that, go to the URL which may be displayed on your CLI, copy the code, paste it at CLI and press Enter. (For more info, refer 1 & 2)

Local execution

Usage

  • src/initializer.py is a python script for initializing either a centralized training, or a decentralized one.
  • This script will create a compose yaml file, initialize client/network containers, execute FL/centralized training and will finally stop running containers (for network/nodes).
  1. Make sure your docker daemon is running
  2. Run the following command -
    • python src/initializer.py
Usage: initializer.py [OPTIONS]

Options:
  --samples_path TEXT       Input path for samples
  --expressions_path TEXT   Input for expressions
  --train_type TEXT         Either centralized or decentralized fashion
  --dataset_size INTEGER    Size of data for training
  --split_type TEXT         balanced / unbalanced / iid / non_iid
  --split_size FLOAT        Train / Test Split
  --n_epochs INTEGER        No. of Epochs / Rounds
  --metrics_path TEXT       Path to save metrics
  --model_save_path TEXT    Path to save trained models
  --metrics_file_name TEXT  Custom name for metrics file
  --no_of_clients INTEGER   Clients / Nodes for decentralized training
  --swarm TEXT              Option for switching between docker compose vs docker stack
  --no_cuda TEXT            no_cuda = True means not to use CUDA. Default --> use CPU
  --tags TEXT               Give tags for the data, which is to be sent to the nodes
  --node_start_port TEXT    Start port No. for a node
  --grid_address TEXT       grid address for network
  --grid_port TEXT          grid port for network
  --help                    Show this message and exit.

Centralized Training

  • Example command:
python src/initializer.py --train_type centralized --dataset_size 17000 --n_epochs 50        
  • Centralized training example output, using 50 epochs:
============================================================
----<DATA PREPROCESSING STARTED..>----
----<STARTED TRAINING IN A centralized FASHION..>----
DATASET SIZE: 17000
Epoch: 0 Training loss: 0.00010540  | Training Accuracy: 0.1666
Epoch: 1 Training loss: 0.00010540  | Training Accuracy: 0.1669
.
.
Epoch: 48 Training loss: 9.3619e-05  | Training Accuracy: 0.4356
Epoch: 49 Training loss: 9.3567e-05  | Training Accuracy: 0.4359
---<SAVING METRICS.....>----
============================================================
OVERALL RUNTIME: 43.217 seconds

DVC Centralized Stage

dvc repro centralized_train

Decentralized Training

  • Example command:
python src/initializer.py --train_type decentralized --dataset_size 17000 --n_epochs 50 --no_of_clients 2     
  • Decentralized training example output, using 50 epochs:
  • Distributed information, like total no. of samples with each client, will be displayed first.
============================================================
----<DATA PREPROCESSING STARTED..>----
----<STARTED TRAINING IN A decentralized FASHION..>----
DATASET SIZE: 17000
TOTAL CLIENTS: 2
DATAPOINTS WITH EACH CLIENT:
client_h1: 8499 ; Label Count: {0: 1445, 1: 1438, 2: 1429, 3: 1432, 4: 1394, 5: 1361}
client_h2: 8499 ; Label Count: {0: 1388, 1: 1395, 2: 1404, 3: 1401, 4: 1439, 5: 1472}
---<STARTING DOCKER IMAGE>----
====DOCKER STARTED!=======
Go to the following addresses: ['http://0.0.0.0:5000', 'http://0.0.0.0:5000/connected-nodes', 'http://0.0.0.0:5000/search-available-tags', 'http://0.0.0.0:3000', 'http://0.0.0.0:3001']
Press Enter to continue...
-------<USING CPU FOR TRAINING>-------
WORKERS:  ['h1', 'h2']
Train Epoch: 0 | With h2 data |: [8499/16998 (50%)]	    Train Loss: 0.000211 | Train Acc: 0.164
Train Epoch: 0 | With h1 data |: [16998/16998 (100%)]	Train Loss: 0.000211 | Train Acc: 0.192
Train Epoch: 1 | With h2 data |: [8499/16998 (50%)]	    Train Loss: 0.000211 | Train Acc: 0.172
Train Epoch: 1 | With h1 data |: [16998/16998 (100%)]	Train Loss: 0.000211 | Train Acc: 0.229
.
.
Train Epoch: 49 | With h2 data |: [8499/16998 (50%)]	Train Loss: 0.000187 | Train Acc: 0.384
Train Epoch: 49 | With h1 data |: [16998/16998 (100%)]	Train Loss: 0.000187 | Train Acc: 0.389
---<STOPPING DOCKER NODE/NETWORK CONTAINERS>----
381c4f79fb5c
c203c2f6fd62
1d3ccce7f732
---<SAVING METRICS.....>----
============================================================
OVERALL RUNTIME: 380.418 seconds

DVC Decentralized Stage

dvc repro decentralized_train

Metrics

  • NOTE: By default, metrics will be saved in data/metrics directory.
  • You can pass in the --metrics_path <path> flag to change the default directory.

Localhosts Example Screenshots

  1. Following is what you may see at http://0.0.0.0:5000
  2. Following is what you may see at http://0.0.0.0:5000/connected-nodes
  3. Following is what you may see at http://0.0.0.0:5000/search-available-tags
  4. Following is what you may see at http://0.0.0.0:3000

Remote Execution

  • Make sure all Firewalls are disabled on both, client and server side.
  • Docker-compose will be required in this section.

Server Side

  • docker-compose -f gridnetwork-compose.yml up

Client Side

  • STEP 1: Configure the environment variable called NETWORK, and replace it with <SERVER_IP_ADDRESS>
  • STEP 2: docker-compose -f gridnode-compose.yml up. You can edit this compose file to add more clients, if you'd like.
  • NOTE: Remote execution has not yet been tested properly.
  • In Progress...

Running DVC stages

  • DVC stages are in dvc.yaml file, to run dvc stage just use dvc repro <stage_name>

Notebooks

  • Notebooks, given in this repository, simulate decentralized training using 2 clients.
  • Docker-compose will be required in this section as well!
  • STEP 1: docker-compose -f notebook-docker-compose.yml up
  • STEP 2: conda activate pysyft_v028 (or source activate pysyft_v028 for older versions of conda)
  • STEP 3: Go to the following addresses:
['http://0.0.0.0:5000', 'http://0.0.0.0:5000/connected-nodes', 'http://0.0.0.0:5000/search-available-tags', 'http://0.0.0.0:3000', 'http://0.0.0.0:3001']
  • STEP 4: Initialize jupyter lab
  • STEP 5: Run data owner notebook: notebooks/data-owner_GTEx.ipynb
  • STEP 6: Run model owner notebook: notebooks/model-owner_GTEx.ipynb
  • STEP 7: STOP Node/Network running containers:
docker rm $(docker stop $(docker ps -a -q --filter ancestor=srijanverma44/grid-network:v028 --format="{{.ID}}"))
docker rm $(docker stop $(docker ps -a -q --filter ancestor=srijanverma44/grid-node:v028 --format="{{.ID}}"))

NOTE:

  • Notebooks given in this repository have been taken from this branch and have been modified.

Hyperparameter Optimization

python src/tune.py --help
  • In Progress...

Testing

  • Test Centralized training:
dvc repro centralized_test
  • Test Decentralized training:
dvc repro decentralized_test

Known Issues

  1. While creating an environment:
    • While creating an env. on a linux machine, you may get the following error: No space left on device. (refer here)
    • Solution:
      • export TMPDIR=$HOME/tmp (i.e. change /tmp directory location)
      • mkdir -p $TMPDIR
      • source ~/.bashrc , and then run the following command -
      • conda env create -f environment.yml
  2. While training:
    • Some errors while training in a decentralized way:
      • ImportError: sys.meta_path is None, Python is likely shutting down
      • Solution - NOT YET RESOLVED!
  3. Notebooks:
    • Data transmission rate (i.e, sending large-sized tensors to the nodes) may be slow. (refer this)

Tutorials / References

  1. OpenMined Welcome Page, high level organization and projects
  2. OpenMined full stack, well explained
  3. Understanding PyGrid and the use of data-centric FL
  4. OpenMined RoadMap
  5. What is PyGrid demo
  6. Iterative, DVC: Data Version Control - Git for Data & Models (2020) DOI:10.5281/zenodo.012345.
  7. iterative.ai
  8. DVC Tutorials

Project Status

Under Development: Please note that the project is in its early development stage and all the features have not been tested yet.

Acknowledgements

  1. I would like to thank all my mentors for taking the time to mentor me and for their invaluable suggestions throughout. I truly appreciate their constant trust and encouragement!

  2. Open Bioinformatics Foundation admins, helpdesk and the whole community

  3. OpenMined Community, for putting together such a beautiful tech stack and for their constant help throughout!

  4. Systems Biology of Aging Group, for providing me with useful resources, for trusting me throughout and for their constant feedback!

  5. Iterative.ai and DVC, for making all of our lives so much more easier now :)

  6. GSoC organizers, managers and Google.

srijan-gsoc-2020's People

Contributors

antonkulaga avatar vermasrijan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

srijan-gsoc-2020's Issues

drop app-nope

app-nope is MacOS-only, so when it is inside environment it makes it unreproducible

dependency hell

I you have issues with your dependency configuration, you use outdated versions and put to pip the packages that are available at conda-forge. I create a PR to fix it #6 , could you test that it does not break anything?

export with --no-build

Please, do:

conda env export --no-builds > environment.yaml

If you do not specify --no-build then it puts machine-specific build infromation that makes it nor reproducible in many other machines, for instance I had:

ResolvePackageNotFound: 
  - ncurses==6.1=h0a44026_1002
  - sqlite==3.30.1=h93121df_0
  - libsodium==1.0.17=h01d97ff_0
  - appnope==0.1.0=py37hc8dfbb8_1001
  - readline==8.0=hcfe32e1_0
  - zlib==1.2.11=h0b31af3_1006
  - libcxx==10.0.0=h1af66ff_2
  - pyzmq==19.0.1=py37haec44b1_0
  - tk==8.6.10=hbbe82c9_0
  - zeromq==4.3.2=h6de7cb9_2
  - python==3.7.6=cpython_h1fd5dd1_6
  - openssl==1.1.1g=h0b31af3_0
  - libffi==3.2.1=h4a8c4bd_1007
  - xz==5.2.5=h0b31af3_0

ugly 300

  
    def delete_particular_age_examples(self):
        df_series = pd.DataFrame(self.labels["Age"])
        indexes_of_50 = np.where(df_series["Age"] == '50-59')[0].tolist()[300:]
        indexes_of_60 = np.where(df_series["Age"] == '60-69')[0].tolist()[300:]
        indexes_of_20 = np.where(df_series["Age"] == '20-29')[0].tolist()[300:]
        indexes_of_30 = np.where(df_series["Age"] == '30-39')[0].tolist()[300:]
        indexes_of_40 = np.where(df_series["Age"] == '40-49')[0].tolist()[300:]
        indexes_to_delete = indexes_of_50 + indexes_of_60 + indexes_of_20 + indexes_of_30 + indexes_of_40
        
        return indexes_to_delete

What does it mean? You hardcoded 300 there and I have not idea what this constant has this value and why do you want to delete those indexes.

ugly zip

You currently use some zip with gtex (i.e. dvc get-url https://www.dropbox.com/s/cmxruuqi26zweeq/gtex.zip?dl=1 data/ -v ), however if I just blindly follow the instruction things will not work well, as it implies the zip file should be unzipped.
Why not using dvc remote and just dvc repro to do preprocessing?

crazy fetch bash-script

I see you have really weird bash script that does:

# Cloning additional dependencies
git clone https://github.com/OpenMined/PyGrid.git
git clone https://github.com/OpenMined/PyGridNode.git
git clone https://github.com/OpenMined/PyGridNetwork.git

Why do you do git-clone inside bash script? If you want to use other git repos you should use Git Submodules. However, I do not get why it is even needed if you can use official docker containers?

README.md lacks explanations

In README you tell how to set things up but do not explain what is inside the project (for instance wat is GTEX and what do you do inside the code) and how it is intended to be used. If a random person will find this git repo she will get lost

accuracy needs improvements

Accuracy of centralzied model should be similar to accuracy that Vlada gave to you. If the user wants to compare centralized and decentralized accuracies there should be clear instructions how to run each of this two scenarios.

TODO Next (for openmined branch)

  • For older CPUs, tensorflow=1.15.3 may not work. Use tensorflow=1.15.0 OR python 3.6.5 OR install tf from conda package

  • Merge openmined & pysyft yaml to one yaml file

do base path resolution in your notebooks

the way how user may run your notebooks can be different. For instance, I commonly run:

jupyter lab notebooks

In the same time in your notebooks you have hardcoded path and you do not check witch which base folder they are run

samples_path = 'data/gtex/v8_samples.parquet'
expressions_path = 'data/gtex/v8_expressions.parquet'

here the problem can be that the user will start with notebooks as base folder and in such case your code will crash miserable.
For this reason in the beginning of your notebook you sould Path(".").resolve() the folder where you are and then use it as a base for your pathes. If, for instance the user used notebooks working fodler you can just go base = Path("..")

To Do Next:

For Simulations:

  1. Add dataset example
  2. Add Featurize function and use dvc, and split full data into small shards
  3. Update Dockerfile

At Client Side:

  1. Add differential privacy function, at client side, for encrypting the model. (Use PySyft)
  2. Decay needs to changed!! because there are no 'comms_round' now

At Coordinator Side:

  1. 'g1' to be mapped with 'l1' where 'g1' is global model directory and 'l1' has local model trained for 'g1' and 'l1' lies in <c1,c2,c3...>
  2. 'g2' to be mapped with 'l2'
  3. take input as 'g1', and if input = 'g1', then local mod training happens for 'l1', for all clients
  4. Take glob_mod_name as input argument
  5. sampleNo/Total sample - write code below, using metadata of each client
  6. scaling factor is hard coded at the moment
    scaling_factor = weight_scalling_factor(clients_batched, client)

not explained what python initializer.py is doing

When I run:

python initializer.py

I get some metrics printed but it is not explained (neither in readme nor in stdout) what is exactly happening.
Also, if you have some metrics you should save them, i.e. by making dvc stage with metrics as one of the outputs

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.