Git Product home page Git Product logo

strawberry_hack's Introduction

TorchOk

The toolkit for fast Deep Learning experiments in Computer Vision

What is it?

The toolkit consists of:

  • Popular neural network models and custom modules implementations used in our company
  • Metrics used in CV such that mIoU, mAP, etc.
  • Commonly used datasets and data loaders

The framework is based on PyTorch and utilizes PyTorch Lightning for training pipeline routines.

Installation

Docker

One of the ways to install TorchOk is to use Docker:

docker build -t torchok --build-arg SSH_PUBLIC_KEY="<public key>" .
docker run -d --name <username>_torchok --gpus=all -v <path/to/workdir>:/workdir -p <ssh_port>:22 -p <jupyter_port>:8888 -p <tensorboard_port>:6006 torchok

Conda

To remove previous installation of TorchOk environment, run:

conda remove --name torchok --all

To install TorchOk locally, run:

conda env create -f environment.yml

This will create a new conda environment torchok with all dependencies.

Getting started

Training is configured by YAML configuration files which each forked project should store inside configs folder (see configs/cifar10.yml for example). The configuration supports environment variables substitution, so that you can easily change base directory paths without changing the config file for each environment. The most common environment variables are:
SM_CHANNEL_TRAINING โ€” directory to all training data
SM_OUTPUT_DATA_DIR โ€” directory where logs for all runs will be stored SM_NUM_CPUS - number of used CPUs for dataloader

Start training locally

Download CIFAR10 dataset running all cells in notebooks/Cifar10.ipynb, the dataset will appear in data/cifar10 folder.

docker exec -it torchok bash
cd torchok
SM_NUM_CPUS=8 SM_CHANNEL_TRAINING=./data/cifar10 SM_OUTPUT_DATA_DIR=/tmp python train.py --config config/classification_resnet_example.yml

Start SageMaker Training Jobs

Start the job using one of the AWS SageMaker instances. You have 2 ways to provide data inside your training container:

  • Slow downloaded S3 bucket: s3://<bucket-name>/<dirpath>. Volume size is needed to be set when you use S3 bucket. For other cases it can be omitted.
  • Fast FSx access: fsx://<file-system-id>/<mount-name>/<directory>. To create FSx filesystem follow this instructions

Example with S3:

python run_sagemaker.py --config configs/cifar10.yml --input_path s3://sagemaker-mlflow-main/cifar10 --instance_type ml.g4dn.xlarge --volume_size 5

Example with FSx:

python run_sagemaker.py --input_path fsx://fs-0f79df302dcbd29bd/z6duzbmv/tz_jpg --config configs/siloiz_pairwise_xbm_resnet50_512d.yml --instance_type ml.g4dn.xlarge

In case something isn't working inside the Sagemaker container you can debug your model locally. Specify local_gpu instance type when starting the job:

python run_sagemaker.py --config configs/cifar10.yml --instance_type local_gpu --volume_size 5 --input_path file://../data/cifar10

Run tests

docker exec -it torchok bash
cd torchok
python -m unittest discover -s tests/ -p "test_*.py"

Differences in configs sagemaker vs local machine

1. Path to data folder

sagemaker

data:
  dataset_name: ExampleDataset
  common_params:
    data_folder: "${SM_CHANNEL_TRAINING}"

local machine

data:
  dataset_name: ExampleDataset
  common_params:
    data_folder: "/path/to/data"

2. Path to artifacts dir

sagemaker

log_dir: '/opt/ml/checkpoints'

local machine

log_dir: '/tmp/logs'

3. Restore path

do_restore is a special indicator which was designed to be used for SageMaker spot instances training. With this indicator you can debug your model locally and be free to leave the restore_path pointing to some common directory like /opt/ml/checkpoints, where TorchOk will search the checkpoints for.

sagemaker

restore_path: '/opt/ml/checkpoints'
do_restore: '${SM_USER_ENTRY_POINT}'

local machine

restore_path: '/opt/ml/checkpoints'
do_restore: '${SM_USER_ENTRY_POINT}'

Mlflow

To have more convenient logs it is recommended to name your experiment as project_name-developer_name, so that all your experiments related to this project will be under one tag in mlflow

experiment_name: &experiment_name fips-roman

State all the model parameters in mlflow.runName in logger params

logger:
  logger: mlflow
  experiment_name: *experiment_name
  tags:
      mlflow.runName: "siloiz_contrastive_xbm_resnet50_512d"
  save_dir: "s3://sagemaker-mlflow-main/mlruns"
  secrets_manager:
      region: "eu-west-1"
      mlflow_secret: "acme/mlflow"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.