Note: This the latest branch. It contains the latest updated readme so look at it. Other branches are commited in between so avoid those.
Fork the dvc template from https://github.com/realpython/data-version-control
Clone the forked repository to your computer with the git clone
command
git clone [email protected]:YourUsername/data-version-control.git
Make sure to replace YourUsername
in the above command with your actual GitHub username.
- cd to the dvc folder and initialize using
dvc init
. - You can also create a new git branch for experimentation.
- Using dagshub for remote storage so creae a repository on dagshub and follow the dvc storage setup on dagshub. Do it with
dvc remote add path/to_dagshub.dvc
. - Add train and val data to dvc using
dvc add data/raw/train
anddvc add data/raw/val
. Two new files train.dvc and test.dvc will be created. - The original data will be added in .gitignore so they don't get pushed to the github and only the .dvc files will be added to github.
- Push your files to github and original data files to dagshub dvc storage following the commands.
- git add --all
- git commit -m "First commit with setup and DVC files"
- dvc push -r "origin"
- git push --set-upstream origin
- Create a script for praparing the dataset and run it with
python src/prepare.py
. - Add prepared files to dvc and commit others to github using
dvc add data prepared/train.csv data/prepared/test.csv
andgit add --all
andgit commit -m "Created train and test CSV files"
. - Run the model with the training script
python src/train.py
. - Add model to dvc using
dvc add model/model.joblib
. - Add and commit to github
git add --all
andgit commit -m "Trained random forest classifier"
. - Run the evaluate file using
python src/evaluate.py
. A new json file under metrics would be created. I got an accuracy of 98%. - Add and commit the json files to github
git add --all
andgit commit -m "Evaluate the model accuracy"
. - Push all the changes to github and dvc using
git push
anddvc push -r "origin"
. - Tag your commit
git tag -a model -m "RandomForest with accuracy 98%"
. Push your tagsgit push origin --tags
. - You can create further more branches with other experiments and then merge with your final branch.
- Create a new branch and remove the .dvc files as these will be again created in pipeline
dvc remove data/prepared/train.csv.dvc data/prepared/test.csv.dvc model/model.joblib.dvc
- Now to create a pipeline once dvc run command has to be used. Few arguments to look at before running.
- The -n switch gives the stage a name.
- The -d switch passes the dependencies to the command.
- The -o switch defines the outputs of the command.
- The -M switch defines the metrics of the command
- Now running the prepare.py with dvc run as
dvc run -n prepare -d src/prepare.py -d data/raw -o data/prepared/train.csv -o data/prepared/test.csv python src/prepare.py
- A new dvc.yaml file will be created showing the pipeline.
- Similary run for training and evaluate stages.
dvc run -n train -d src/train.py -d data/prepared/train.csv -o model/model.joblib python src/train.py
dvc run -n evaluate -d src/evaluate.py -d model/model.joblib -M metrics/accuracy.json python src/evaluate.py
- Use
dvc metrics show
to see the metrics. - Now add, commit and push to github and dvc.
- Look at
dvc.yaml
to see the whole pipeline. - Now if you want to run other experiments you don't need to run dvc run all the times. Thats what reproducible pipeline was all about.
- Create new branch and train a new model like Logistic regression.
- Now change the model in training.pt file and use
dvc status
to see the changes inside the files of pipeline. - Now to run this logistic regression function use
dvc repro evaluate
. This will re run the training and evaluate stages of the pipeline. - Now see the metrics but this time add a new flag -T to see metrics created by all runs.
dvc metrics show -T
.
So now to run multiple experiments one can just make changes to the necessary files and use dvc repro evaluate
to run the pipeline.