Git Product home page Git Product logo

dvc_example's Introduction

Using DVC and CML for sharing ML projects across teams

This project is motivated as an example to show how to share datasets across teams in the AI function at Corsearch.

Requirements:

  • Versioned datasets stored in an accessible remote storage
  • Ability to identify the best of breed solution to the problem

What is this project

This repository contains a sample project using CML with DVC to push/pull data from cloud storage and track model metrics. When a pull request is made in this repository, the following will occur:

- Github will deploy a runner machine with a specified CML Docker environment
- DVC will pull data from cloud storage
- The runner will execute a workflow to train a ML model (python train.py)
- A visual CML report about the model performance with DVC metrics will be returned as a comment in the pull request

Reproduceability

ML = Data + Code.

DVC works as Git for Data so we have full lineage of what code + data produces what results. For separate teams to work on the same problems we need to be able to assess the output

DVC saves the md5 hashes of the data files in git. An example of a dvc file below

outs:
- md5: 79b2176dd366f3be286780a501207603
  size: 990848
  path: X_train.npy

As this dvc file is checked into version control we know what data and outputs relate to which version of code.

DVC pipelines provide md5 hashes of all parts of an ML workflow including data and code so we have full reproducability.``

Cloning this project

Note that if you clone this project, you will have to configure your own DVC storage and credentials for the example. We suggest the following procedure:

  1. Fork the repository and clone to your local workstation.
  2. Run python src/prepare_data.py to generate your own copy of the dataset.
  3. Initialise DVC dvc init and setup the remote storage dvc remote add storage <your bucket>, dvc remote default storage. For GCS backed storage you will need dvc remote modify storage projectname incopro-ml-dev
  4. Push your data to DVC storage dvc add data/raw, dvc push
  5. git add, commit and push to push your DVC configuration to GitHub.
  6. Add your storage credentials as repository secrets.
  7. Copy the workflow file .github/workflows/cml.yml from this repository to your fork. By default, workflow files are not copied in forks. When you commit this file to your repository, the first workflow should be initiated.

Setting up CML as a runner for Github

dvc_example's People

Contributors

jwgwalton avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.