The toolkit consists of:
- Popular neural network models and custom modules implementations used in our company
- Metrics used in CV such that mIoU, mAP, etc.
- Commonly used datasets and data loaders
The framework is based on PyTorch and utilizes PyTorch Lightning for training pipeline routines.
One of the ways to install TorchOk is to use Docker:
docker build -t torchok --build-arg SSH_PUBLIC_KEY="<public key>" .
docker run -d --name <username>_torchok --gpus=all -v <path/to/workdir>:/workdir -p <ssh_port>:22 -p <jupyter_port>:8888 -p <tensorboard_port>:6006 torchok
To remove previous installation of TorchOk environment, run:
conda remove --name torchok --all
To install TorchOk locally, run:
conda env create -f environment.yml
This will create a new conda environment torchok with all dependencies.
Training is configured by YAML configuration files which each forked project should store inside configs
folder
(see configs/cifar10.yml
for example). The configuration supports environment variables substitution,
so that you can easily change base directory paths without changing the config file for each environment.
The most common environment variables are:
SM_CHANNEL_TRAINING โ directory to all training data
SM_OUTPUT_DATA_DIR โ directory where logs for all runs will be stored
SM_NUM_CPUS - number of used CPUs for dataloader
Download CIFAR10 dataset running all cells in notebooks/Cifar10.ipynb
,
the dataset will appear in data/cifar10
folder.
docker exec -it torchok bash
cd torchok
SM_NUM_CPUS=8 SM_CHANNEL_TRAINING=./data/cifar10 SM_OUTPUT_DATA_DIR=/tmp python train.py --config config/classification_resnet_example.yml
Start the job using one of the AWS SageMaker instances. You have 2 ways to provide data inside your training container:
- Slow downloaded S3 bucket:
s3://<bucket-name>/<dirpath>
. Volume size is needed to be set when you use S3 bucket. For other cases it can be omitted. - Fast FSx access:
fsx://<file-system-id>/<mount-name>/<directory>
. To create FSx filesystem follow this instructions
Example with S3:
python run_sagemaker.py --config configs/cifar10.yml --input_path s3://sagemaker-mlflow-main/cifar10 --instance_type ml.g4dn.xlarge --volume_size 5
Example with FSx:
python run_sagemaker.py --input_path fsx://fs-0f79df302dcbd29bd/z6duzbmv/tz_jpg --config configs/siloiz_pairwise_xbm_resnet50_512d.yml --instance_type ml.g4dn.xlarge
In case something isn't working inside the Sagemaker container you can debug your model locally.
Specify local_gpu
instance type when starting the job:
python run_sagemaker.py --config configs/cifar10.yml --instance_type local_gpu --volume_size 5 --input_path file://../data/cifar10
docker exec -it torchok bash
cd torchok
python -m unittest discover -s tests/ -p "test_*.py"
data:
dataset_name: ExampleDataset
common_params:
data_folder: "${SM_CHANNEL_TRAINING}"
data:
dataset_name: ExampleDataset
common_params:
data_folder: "/path/to/data"
log_dir: '/opt/ml/checkpoints'
log_dir: '/tmp/logs'
do_restore
is a special indicator which was designed to be used for SageMaker spot instances training.
With this indicator you can debug your model locally and be free to leave the restore_path
pointing to some
common directory like /opt/ml/checkpoints
, where TorchOk will search the checkpoints for.
restore_path: '/opt/ml/checkpoints'
do_restore: '${SM_USER_ENTRY_POINT}'
restore_path: '/opt/ml/checkpoints'
do_restore: '${SM_USER_ENTRY_POINT}'
To have more convenient logs it is recommended to name your experiment as project_name-developer_name
, so that all your experiments related to this project will be under one tag in mlflow
experiment_name: &experiment_name fips-roman
State all the model parameters in mlflow.runName
in logger params
logger:
logger: mlflow
experiment_name: *experiment_name
tags:
mlflow.runName: "siloiz_contrastive_xbm_resnet50_512d"
save_dir: "s3://sagemaker-mlflow-main/mlruns"
secrets_manager:
region: "eu-west-1"
mlflow_secret: "acme/mlflow"