Data Version Control

Note: This the latest branch. It contains the latest updated readme so look at it. Other branches are commited in between so avoid those.

Fork the dvc template from https://github.com/realpython/data-version-control

Clone the forked repository to your computer with the git clone command

git clone [email protected]:YourUsername/data-version-control.git

Make sure to replace YourUsername in the above command with your actual GitHub username.

Steps:

cd to the dvc folder and initialize using dvc init.
You can also create a new git branch for experimentation.
Using dagshub for remote storage so creae a repository on dagshub and follow the dvc storage setup on dagshub. Do it with dvc remote add path/to_dagshub.dvc.
Add train and val data to dvc using dvc add data/raw/train and dvc add data/raw/val. Two new files train.dvc and test.dvc will be created.
The original data will be added in .gitignore so they don't get pushed to the github and only the .dvc files will be added to github.
Push your files to github and original data files to dagshub dvc storage following the commands.
- git add --all
- git commit -m "First commit with setup and DVC files"
- dvc push -r "origin"
- git push --set-upstream origin
Create a script for praparing the dataset and run it with python src/prepare.py.
Add prepared files to dvc and commit others to github using dvc add data prepared/train.csv data/prepared/test.csv and git add --all and git commit -m "Created train and test CSV files".
Run the model with the training script python src/train.py.
Add model to dvc using dvc add model/model.joblib.
Add and commit to github git add --all and git commit -m "Trained random forest classifier".
Run the evaluate file using python src/evaluate.py. A new json file under metrics would be created. I got an accuracy of 98%.
Add and commit the json files to github git add --all and git commit -m "Evaluate the model accuracy".
Push all the changes to github and dvc using git push and dvc push -r "origin".
Tag your commit git tag -a model -m "RandomForest with accuracy 98%". Push your tags git push origin --tags.
You can create further more branches with other experiments and then merge with your final branch.

Creating reproducible pipelines

Create a new branch and remove the .dvc files as these will be again created in pipeline

dvc remove data/prepared/train.csv.dvc data/prepared/test.csv.dvc model/model.joblib.dvc

Now to create a pipeline once dvc run command has to be used. Few arguments to look at before running.
- The -n switch gives the stage a name.
- The -d switch passes the dependencies to the command.
- The -o switch defines the outputs of the command.
- The -M switch defines the metrics of the command
Now running the prepare.py with dvc run as dvc run -n prepare -d src/prepare.py -d data/raw -o data/prepared/train.csv -o data/prepared/test.csv python src/prepare.py
A new dvc.yaml file will be created showing the pipeline.
Similary run for training and evaluate stages.
- dvc run -n train -d src/train.py -d data/prepared/train.csv -o model/model.joblib python src/train.py
- dvc run -n evaluate -d src/evaluate.py -d model/model.joblib -M metrics/accuracy.json python src/evaluate.py
Use dvc metrics show to see the metrics.
Now add, commit and push to github and dvc.
Look at dvc.yaml to see the whole pipeline.
Now if you want to run other experiments you don't need to run dvc run all the times. Thats what reproducible pipeline was all about.
Create new branch and train a new model like Logistic regression.
Now change the model in training.pt file and use dvc status to see the changes inside the files of pipeline.
Now to run this logistic regression function use dvc repro evaluate. This will re run the training and evaluate stages of the pipeline.
Now see the metrics but this time add a new flag -T to see metrics created by all runs. dvc metrics show -T.

Conclusion

So now to run multiple experiments one can just make changes to the necessary files and use dvc repro evaluate to run the pipeline.

chiragchauhan4579 / dvc-pipeline Goto Github PK

dvc-pipeline's Introduction

Data Version Control

Steps:

Creating reproducible pipelines

Conclusion

dvc-pipeline's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent