Helsinki Machine Learning Project Template

NOTE: Once you begin your work, rewrite this notebook (index.ipynb) so that it describes your project, and regenerate README by calling `nbdev_build_docs`

About

This is a git repository template for Python-based open source ML and analytics projects.

The template assumes the concept of Notebook Development. This means, that you do all the data science work inside notebooks. There is no copy-pasting! We use the nbdev tool to build python modules and doc pages from the notebooks, automatically. This way you always have your code, results and documentation as one. Notebooks can be executed with the papermill tool for an automatic, well documented model update workflow. Handy, isn't it?

The template assumes that you divide your machine learning project into 5 parts:

Data - loading & preprocessing
Model - Python class code & algorithm development
Loss - model training & evaluation
Workflow - automatic model update (reproduce steps 0.-2.)
API - an interface to interact with a trained model

Each part has their own notebook template, that you can follow to plan and do your development.

In addition, the template comes with a working Dockerfile and .devcontainer for doing your development easily with any device. You can extend these for your needs and for building a runtime container for your machine learning app.

The template is completely open source and environment agnostic. Follow the installation instructions to create a new, independent repository with clean commit history, but with a copy of all the files and folders presented. The authors of this template will not be contributors to your project, although we are more hear what you have achieved with it! Also, if you don't like something or know an improvement, your contribution is very welcome!

Note, that updates to the template can not be automatically pulled to child projects.

The template is developed and maintained by the data and analytics team of the city of Helsinki. The template is published under the Apache-2.0 licence and open source utilization is encouraged!

The core structure of the repository is the following:

## EDITABLE:
data/               # Folder for storing data files. Ignored by git by default.
|- raw_data/        # To store raw data files
|- preprocessed_data/   # To store cleaned data
results/            # Save results here. Ignored by git by default.
|- notebooks/       # Save automatically executed notebooks here
00_data.ipynb       # Extract, transfer, load data here & define related functions.
01_model.ipynb      # Create and code test your ML model
02_loss.ipynb       # Train and evaluate ML model, deploy or save for later use
03_workflow.ipynb   # Define ML workflow and parameterization
04_api.ipynb        # Define runtime API for using trained ML model
project-requirements.in    # Add here the Python packages you want to install
update_install_dev_reqs.sh  # run this script to install new python packages
settings.ini        # Project specific settings. Build instructions for lib and docs.
Dockerfile          # Define docker image build instructions
.devcontainer       # Codespaces / VSC dev environment instructions

## AUTOMATICALLY GENERATED: (Do not edit unless otherwise specified!)
docs/               # Project documentation (html)
[your_module]/      # Python module built from the notebooks (follow the installation instructions).
README.md           # The frontpage of your project, generated from index.ipynb
requirements.txt    # dev / default requirements. automatically generated by pip-tools
min-requirements.txt # lighter requirements without dev tools. automatically generated by pip-tools

## STATIC NON-EDITABLE: (Edit only if you know what you're doing!)
base-requirements.in    # core tools that every project built based on the template always requires
requirements.in    # development tools + project spesific requirements
LISENCE                 # lisence information
MANIFEST.in             # metadata for building python distributable
setup.py                # settings for the python module of your project
CODE_OF_CONDUCT.md      # code of conduct. Please review before contributing.

How to install

{% include note.html content='if you are doing a project on personal data for the City of Helsinki, contact the data and analytics team of the city before proceeding any further!' %}

1. On your GitHub homepage

(Create GitHub account if you do not have one already.
Sign into your GitHub homepage
Go to github.com/City-of-Helsinki/ml_project_template and click the green button that says 'Use this template'.
Give your project a name. Do not use the dash symbol '-', but rather the underscore '_', because the name of the repo will become the name of your Python module.
If you are creating a project for your organization, change the owner of the repo. From the drop down bar, select your organization GitHub account (e.g. City-of-Helsinki). You need to be included as a team member to the GitHub of the organization.
Define your project publicity (you can change this later, but most likely you want to begin with a private repo).
Click 'Create repository from template'

This will create a new repository for you copying everything from this template, but with clean commit history.

2. Setting up your development environment

a) Recommended: Codespaces

If your organization has Codespaces enabled (requires GitHub Enterprise & Azure subscription), you are now ready to begin development. Just launch the repository in a codespace, and a dev container is automatically set up!

b) Can't use Codespaces: Local installation with Docker

You can build a development environment locally with docker. The recommended way is to use VSC in container development mode (link to instructions).

c) Can't use Docker: Local manual installation

You can also do your development 'the good old way':

Create an SSH key and add it to your github profile (instructions)
Configure your git user name and email adress if you haven't done it already: git config --global user.name "Firstname Lastname" && git config --global user.email "[email protected]"
Clone your new repository: git clone [email protected]:[repository_owner]/[your_repository]
Go inside the repository folder: cd [your_repository]
Create and activate virtual environment of your choice. Remember to define the Python version to 3.8! (Instructions: conda, venv)
Install pip-tools: python -m pip install pip-tools
Install requirements: pip-sync requirements.txt
Create an ipython kernel for running the notebooks: python -m ipykernel install --user --name python38myenv
The default development enviroment contains basic Jupyter, and many IDEs have built-in support notebooks. If you wish, you can install JupyterLab by uncommenting it in requirements.in and re-running pip-sync. To launch JupyterLab, run jupyter-lab --allow-root --config .devcontainer/jupyter-server-config.py

d) Can't connect to internet: Offline install with Docker

Sometimes you have to work in an environment that can not be connected to the internet, for example for privacy or cybersecurity reasons. In this case, first install the template and all packages that you assume you will require to an environment with internet, and build the docker image as in 2c). Then, save the docker image and transfer it to your offline environment following these instructions.

3. Initializing your project

Few last tweaks before you are good to go:

Edit LICENCE, Makefile, settings.ini, docs/_config.yml and docs/_data/topnav.yml according to your project details. Don't worry - you can continue editing them in the future.
Remove the folder ml_project_template with the command git rm -r ml_project_template. A new folder with the name of your repository will be created automatically when calling nbdev_build_lib.
Recreate the python module: nbdev_build_lib. In the future, repeat this step every time you move between notebooks to ensure your python modules are up to date.
Recreate the html doc pages & README: nbdev_build_docs. In the future, repeat this step every time you push code to ensure your documentation is up to date.
Make initial commit: git add . && git commit -m "initialized repository from City-of-Helsinki/ml_project_template"
Push changes git push -u origin master

You are now ready to begin your ML project development. Remember to track your changes with git!

How to use

Install this template as basis of your new project (see above).
If you are not working inside a container, remember to activate your virtual environment every time you begin work: conda activate [environment name] with anaconda or source [environment name]/bin/activate with virtualenv.
Develop your ML solution! (Follow the notebooks!)
Save your notebooks and call nbdev_build_lib to build python modules of your notebooks - needed if you want to share code between notebooks or create a modules. This will export all notebook cells with # export tag to corresponding .py files under the module (the folder inside your repository named after your repository). Do this every time you make changes to any exportable parts of the code.
Save your notebooks and call nbdev_build_docs to create doc pages based on your notebooks (see below). This will convert the notebooks into HTML files under docs/ and update README based on the index.ipynb. If you want to host your project pages on GitHub (like the doc pages of this template), you will have to make your project public and enable github pages in repo > Settings > Pages : set Source to docs/. Alternatively you can build the pages locally with jekyll.

Installing & updating project libraries

Python has a rich and wide ecosystem of libraries to help with machine learning tasks among other things. Pandas, Matplotlib, Scipy, PyTorch to name a few. If base libraries in this template aren't sufficient you can add more with pip install library. However, pip command installs libraries into your local Python environment. To achieve consistent reproducibility we need to gather information about requirements into project repository. New libraries are added to project-requirements.in file. When you change this file remember to run:

pip-compile --generate-hashes --allow-unsafe -o requirements.txt base-requirements.in requirements.in project-requirements.in
pip-compile --generate-hashes --allow-unsafe -o min-requirements.txt base-requirements.in project-requirements.in

These update full requirements for development environments and lighter, more focused requirements for server usage.

After requirements are updated you should run:

pip-sync requirements.txt

This way libraries you and other users will have the same Python environment.

NOTE: run `./update_install_dev_reqs.sh` for short - it contains the three above pip commands for updating and installing the requirements!

WARNING: if you don't update package names and versions next time you or anybody else tries to use this project in another environment its code might not work. Worse, it might *seem to* work, but does so incorrectly.

Ethical aspects

Please involve ethical consideration in the documentation ML application.

For example:

Can you recognize ethical issues with your ML project?
Is there a risk for bias, discrimination, violation of privacy or conflict with the local or global laws?
Could your results or algorithms be misused for malicious acts?
Can data or model updates include bias in your model?
How have you tackled these issues in your implementation?
You most certainly make ethical choises in your code. Do you document & highlight them?
If you build an actual application, how can contribute if they notice an unresolved ethical issue?

How to cite (optional)

If you are doing a research project, you can add bibtex and other citation templates here. You can also get a doi for your code by adding it to a code archive, so your code can be cited directly! Most archives also provide repository badges.

To cite this work, use:

@misc{sten2022helsinki,
title = {Helsinki Machine Learning Project Template},
author = {Nuutti A Sten and Jussi Arpalahti},
year = {2022},
howpublished = {City of Helsinki. Available at: \url{https://github.com/City-of-Helsinki/ml_project_template}}
}

Contributing

See CONTRIBUTING.md on how to contribute to the development of this template.

Copyright

Copyright 2022 City-of-Helsinki. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this project's files except in compliance with the License. A copy of the License is provided in the LICENSE file in this repository.

The Helsinki logo is a registered trademark, and may only be used by the city of Helsinki.

NOTE: If you are using this template for other than city of Helsinki projects, remove the files `favicon.ico` and `company_logo.png` from `docs/assets/images/`.


# to remove remove helsinki logo and favicon, run:
git rm docs/assets/images/favicon.ico docs/assets/images/company_logo.png
git commit -m "removed Helsinki logo and favicon"

This template was built using nbdev on top of the fast.ai nbdev_template.

todowede / projecta Goto Github PK

projecta's Introduction

Helsinki Machine Learning Project Template

About

Contents