Git Product home page Git Product logo

ml-project-template's Introduction

ML Project Template

This repository contains a template project that can be easily adapted for all kinds of Machine Learning tasks. Typically, solving such task entails two main phases, research and production with very different focuses. The template intends to faciliatate work on ML projects by guiding practitioners to adopt some best practices.

research: exploratory data analyses, model prototyping and experiments are dumped here in a structured way

production: distilled utils lib, training job and inference service are implemented here

It is recommended to simply clone this repo and customize it to the specific use-case at hand.


Repository Structure

  • research: Scripts and Notebooks for experimentation.
    • develop (Python): Experimental code to try out new ideas and experiments. Use Jupyter notebooks wherever you can. Naming convention: YYYY-MM-DD_userid_short-description. If you cannot use a notebook and have multiple scripts/files for an experiment, create a folder with the same naming convention. Each file should be handled by one person only.
    • deliver (Python): Refactored notebooks that contain valuable insights or results (e.g. visualizations, training runs). Notebooks should be refactored, documented, contain outputs, and use the following naming schema: YYYY-MM-DD_short-description. Notebooks in deliver should not be changed or rerun. If you want to rerun a deliver Notebook, please duplicate it into the develop folder.
    • templates (Python): Refactored Notebooks that are reusable for a specific task (e.g. model training, data exploration). Notebooks should be refactored, documented, not contain any output, and use the following naming schema: short-description. If you like to make use of a template Notebook, duplicate the notebook into develop folder.
  • production: The production-ready solution(s) composed of libraries, services, and jobs.
    • python-utils-lib (Python): Utility functions that are distilled from the research phase and used across multiple scripts. Should only contain refactored and tested Python scripts/modules. Installable via pip.
    • training-job (Python/Docker): Combines required data exports, preprocessing and training scripts into a Docker container. This makes results reproducible and the production model retrainable in any ennvironment.
    • inference-service (Python/Docker): Docker container that provides the final model prediction capabilities via a REST API.

Naming Conventions

Code Artifacts

  • develop notebooks/scripts: YYYY-MM-DD_userid_short-description
  • deliver notebooks/scripts: YYYY-MM-DD_short-description
  • template notebooks/scripts: short-description
  • services: -service suffix
  • jobs: -job suffix
  • libraries: -lib suffix

Files

<dataset-desc>_<preprocessing-desc>_<training-desc>.<filetype>

Examples:

  • blogs-metadata.csv
  • blogs-metadata_cl-rs_ft-vec.vectors
  • categories2blogs_cl-rs-lm_tfidf-lsvm.model.zip
  • categories2blogs-questions_cl-rs-lm_tfidf-lsvm.model.zip

Name Identifier Descriptions:

Name Description
Dataset Identifiers:
categories2blogs Dataset containing blogs with the text content, blogs item URI, and connected primary tags.
blogs-metadata Dataset containing all blogs and related metadata (properties).
Preprocessing Identifiers:
cl Default text cleaning (lowercasing, regex cleaning).
rs Remove Stopwords.
lm Text lemmatization.
Training Identifiers:
ft-vec Text vectorizer using Fasttext.
tfidf Text vectorizer using TFIDF.
lsvm Classifier using linear SVM.
Filetype Identifiers:
.model Model file.
.vectors Binary vectors file.

ml-project-template's People

Contributors

ben0it8 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.