intertwin-eu / itwinai Goto Github PK

Advanced AI workflows for digital twins applications in science.

Home Page: https://itwinai.readthedocs.io

License: MIT License

Python 75.79% Shell 6.74% Common Workflow Language 15.03% Makefile 2.44%

itwinai's Introduction

PoC for AI-centric digital twin workflows

See the latest version of our docs for a quick overview of this platform for advanced AI/ML workflows in digital twin applications.

If you want to integrate a new use case, you can follow this step-by-step guide.

Installation

Requirements:

Linux, macOS environment. Windows was never tested.

Micromamba installation

To manage Conda environments we use micromamba, a light weight version of conda.

It is suggested to refer to the Manual installation guide.

Consider that Micromamba can eat a lot of space when building environments because packages are cached on the local filesystem after being downloaded. To clear cache you can use micromamba clean -a. Micromamba data are kept under the $HOME location. However, in some systems, $HOME has a limited storage space and it would be cleverer to install Micromamba in another location with more storage space. Thus by changing the $MAMBA_ROOT_PREFIX variable. See a complete installation example for Linux below, where the default $MAMBA_ROOT_PREFIX is overridden:

cd $HOME

# Download micromamba (This command is for Linux Intel (x86_64) systems. Find the right one for your system!)
curl -Ls https://micro.mamba.pm/api/micromamba/linux-64/latest | tar -xvj bin/micromamba

# Install micromamba in a custom directory
MAMBA_ROOT_PREFIX='my-mamba-root'
./bin/micromamba shell init $MAMBA_ROOT_PREFIX

# To invoke micromamba from Makefile, you need to add explicitly to $PATH
echo 'PATH="$(dirname $MAMBA_EXE):$PATH"' >> ~/.bashrc

Reference: Micromamba installation guide.

Workflow orchestrator

Install the (custom) orchestrator virtual environment.

source ~/.bashrc
# Create local env
make

# Activate env
micromamba activate ./.venv

To run tests on workflows use:

# Activate env
micromamba activate ./.venv

pytest tests/

Documentation folder

Documentation for this repository is maintained under ./docs location. If you are using code from a previous release, you can build the docs webpage locally using these instructions.

Development env setup

Requirements:

Linux, macOS environment. Windows was never tested.
Micromamba: see the installation instructions above.
VS Code, for development.

Installation:

make dev-env

# Activate env
micromamba activate ./.venv-dev

To run tests on itwinai package:

# Activate env
micromamba activate ./.venv-dev

pytest tests/ai/

AI environment setup

Requirements:

Linux, macOS environment. Windows was never tested.
Micromamba: see the installation instructions above.
VS Code, for development.

NOTE: this environment gets automatically setup when a workflow is executed!

However, you can also set it up explicitly with:

make ai-env

# Activate env
micromamba activate ./ai/.venv-pytorch

Updating the environment files

The files under ai/env-files/ are of two categories:

Simple environment definition, such as pytorch-env.yml and pytorch-env-gpu.yml
Lockfiles, such as pytorch-lock.yml and pytorch-gpu-lock.yml, generated by conda-lock.

When you install the ai environment, install it from the lock file!

When the "simple" environment file (e.g., pytorch-env.yml) changes, lock it with conda-lock:

micromamba activate ./.venv

make lock-ai

itwinai's People

Contributors

Stargazers

Watchers

Forkers

user3574 kalliopitsolaki jeffhere

itwinai's Issues

Create Horovod Testbed

Horovod is framework for distributed machine learning, that supports both Tensorflow and PyTorch and also provides a single interface for both of them. It is therefore of interest to T6.5 and will be investigated. Specific work to be done is:

Implement a minimal working codebase doing distributed learning using horovod
Conduct performance studies comparing horovod to native TF and PyTorch implementation.

Integrate CMCC use case

Integrate CMCC use case into prototype.
Code repository can be found here.
Data is accessible via google drive here.

There are multiple tasks to be done:

check that code is working as provided in the repo
split use case in preprocessing, training, validation/post-processing part
integrate by porting to PyTorch or providing TF support

MLFlow with object storage

Implement model registry using object storage to unluck full potential of MLFLow using databases.

Create docs webpage

For instance with Jekyll

Add Python tests + GH actions support

Write tests with pytest for AI module, testing also GH actions for automatic testing: see these docs

itwinai not supported on MacOS

Extend the support to other OSs for quick prototyping

Assessment of openEO suitability for AI workflows composition

Analize the viability of openEO to

orchestrate AI/ML workflows
trigger AI/ML workflows implemented outside of openEO engine (e.g., in a container)

Advanced YAML configuration files as an input for workflow steps

Define an in-project standard to complile configuration files, to instantiate each step in a digital twin workflow.

An example for AI module:

Tasks:

#31 as explained here
Include YAML schema validation when loading YAML conf file from some use case
Include YAML schema validation in GH actions. when pushing YAML conf files to the repo

Implement `itwinai v0.1` - Consolidated AI training

Improve over the minimal working baseline for the AI module, following the requirements defined for itwinai v0.1.

Generic logger (Torch) for MNIST

Generic logger interface for logging in a Torch training loop. Support MLFlow, WandB and Tensorboard.

Generic logger for CMCC

Integrate generic TF logging for CMCC use case

Define schemas for our YAML files

As explained here

Generic logging for CERN use case

Integrate generic TF logging for CERN use case

Migrate to CWL workflow definition

Currently a digital twin workflow is defined using YAML files witha custom format. Migrate to the standard CWL workflow composition and engine.

Integrate Apache Airflow with CWL

Prove that Apache Airflow can run worklfows written according to CWL definition.

Goal: execute our CWL workflows on a cluster with Apache Airflow. Apache Airflow shoud replace run-workflow.py script in this repository.

To begin with, start by defining a a simple CWL workflow of, say, 3 steps and execute it with Apache Airflow (has 3 steps in Airflow as well)

flowchart LR
  a(print 'a')
  b(print 'b')
  c(print 'c')
  a --> b --> c

It is of high priority to integrate with FZJ and WP5.

Easy way to import models in inference mode

Alternative to MLFlow's runID

Integrate CERN use case

Tasks:

Adapt CERN use case to PyTorch Lightning starting from this tutorial
Integrate use case into itwinai followin this guide
Write tests under tests/use-cases/ for the newly integrated use case

Move documentation from wiki to repo

This is needed to have a self contained release, in which docs are consistent with code.

Also, links to other ML diles should be relative. See here

Add test with pytest

Tests are needed on the ai component, namely in the ./ai subfolder, in which most of the development is going to occurr. The other code is less critical and tests can be skipped.

Distributed training (TF) for CMCC

Implement TF distributed trainer for CMCC use case:

TensorFlow distributed trainer class (#66)

Pipeline representation for ML workflows (MNIST)

Allows modularity and code reuse, providing interTwin use cases off-the-shelf operations which they can reuse form other use cases, or extend if needed. It is the result of the latest developments in our code base in the last months.

Define interaction with workflow composition (T6.1)

Goal

Define how to handle "triggers" received by the workflow composition tool, namely the orchestrator in the DTE core. This is infrastructure dependent.

Alternatives

openEO's UDF (+ REST API?)
HTTP hook
Message queue
Other?

Background

Containers

After analysing requirements from use cases (WP4), DTE core (WP6), and infrastructure (WP5), it is likely that the "common language" is containers (e.g., Docker, Singularity).
Infrastructure may not work at workflow/"pipeline" level, thus this abstraction would be implemented by T6.1. Namely, T6.1 is going to:
- Break workflows into steps, and deploy the steps on the architecture as containers
- Listen for events (e.g., new data is available: dCache + Nifi + OSCAR), and trigger the corresponding workflow (e.g., new training data is available, then trigger ML training).
- Run a workflow step-by-step: once a step (container) has completed, trigger the next one using Kubernetes-like APIs.

Sub-workflows

Different workflow steps may be run using different (sub-)workflow engines. For instance:

Big data pre-processing of satellite images shall be carried out using openEO workflows (e.g., using Spark would be sub-optimal).
AI/ML could be carried out by using Kubeflow or Elyra workflows, which are AI-centric.
And so on...

As a result, in the same DT workflow we may use multiple workflow engines, which are tailored/optimized for the needs of each workflow step. The goal is to give freedom to individual tasks to use their preferred workflow engine, which is often the more optimized one for that task.

Conceptually, T6.1 is developing a workflow manager, which is working super partes, thus implementing a "Super orchestrator/workflow manager". This high-level orchestrator is agnostic from the operation executed in each node. It may work with common workflow language.

Below, an example of a toy workflow concerning the prediction of fire risk maps on satellite images, concerning the following steps:

Big data pre-processing: transform training and inference (i.e., unseen) satellite images, preparing them for ML workflows. In principle, both training and inference images may be pre-processed in the same way.
GAN neural network training: train a ML model to predict fire risk maps, and save it to the models registry.
Data fusion and visualization: apply the trained ML model on unseen satellite images and show fire risk predictions to the user (visualization component).

In this case:

the "super orchestrator" may be developed by T6.1
"Big data pre-processing" may be developed by T6.4
"GAN neural net training" may be developed by T6.5
"Data fusion and visualization" may be developed by T6.3

NOTE: In practice, the "super orchestrator" could be implemented by re-using one of the engines required by some task, with reduced maintenance cost. However, it has to support for general-purpose workflow, which steps are deployed as containers.

Documentation of new pipeline API

Update the docs under ./docs folder to reflect the changes in the code base. Add tutorials.

Distributed training TensorFlow MNIST

use MirroredStrategy

Add Python tests + GH actions support

Write tests with pytest for AI module, testing also GH actions for automatic testing: see these docs

Distributed torch (+ lightning) trainer for MNIST

Multi-GPU, data parallel

Define iteraction with "Quality and uncertainty tracing" module (T6.2)

"Quality and uncertainty tracing" module is going to provide the end user with tools to easily evaluate digital twin (DT) models and workflows. Generally speaking, ML validation is carried out by T6.5 (e.g., training v. validation v. test metrics comparison), whereas the "Quality and uncertainty tracing" module provides:

more advanced validations of the trained model (e.g., physics-based validations)
easy-to-use CI/CD pipelines through the SQAaaS module, which allow to validate
- FAIRness data principles
- source code
- (micro)services, in a black-blox manner

Goal

Define interface between AI/ML workflows (T6.5) and "Quality and uncertainty tracing" module. In the case of ML models, the validation may consist of:

re-loading a pre-trained model from the models registry and perform advanced validations required by the use case (e.g., physics-based validations), or alternatively, perform black-box test of a ML deployment (e.g., a NN in a container) using test cases provided by T6.5, without the need of directly interacting with the models registry.
provide feedback on FAIR principles compliance of the "models registry".
CI/CD black-box testing AI/ML workflows, treated as microservices (including ML deployment)
CI/CD source code validation

Background

A main challenge consists in the validation of a model trained in an online fashion.

Analyze openEO suitability for high energy physics

This is a bit off-topic from hard T6.5. However, we are intersted in analyzing how openEO could be extended to high energy physics (HEP) domains, like detector simulation use case (T4.2).

For instance, openEO could be used in the pre-processing phase, to convert ROOT file format into HDF5.

This is important because is would allow to scale T4.2 to big data.

Distributed training for Virgo

Integrate pix2pix into torch distributed trainer

Generic logger (TF) for MNIST

Generic logger interface for logging in a TF training loop. Support MLFlow, WandB and Tensorboard.

Distributed training for CERN use case

Update the core API, if needed, to support distributed training for CERN use case (pytorch lightning and tensorflow)

Update torch distributed trainer to support GANs

Integrate workflow v0 on HPC infrastructure

Validate Workflow v0 on HPC infrastructure.

At the moment workflow steps are executed inside conda environments (created using mamba), but HPC systems may not support conda. The advantage of using conda is that it allows to easily build any python version, but this solution has to be validated againsta HPC policies and best practices.

Actions

Run on HPC system
Optimize according to HPC best practices

High-level documentation

Let's produce some high-level quick-to-maintain docs to help us explain what this MLOps platform does.

General architecture / concept: see the wiki
UI for use case
How to use (e.g., workflow definition)

Let's use the wiki for now

Agree on Python docstring format

Agree on the format of docstrings used in Python modules.

Suggestions: https://marketplace.visualstudio.com/items?itemName=njpwerner.autodocstring

Support for custom models

For lightning, take into account @..._REGISTRY decorators:

from pytorch_lightning.utilities.cli import CALLBACK_REGISTRY
from pytorch_lightning.utilities.cli import MODEL_REGISTRY, DATAMODULE_REGISTRY

Consider this tool: https://github.com/openstack/stevedore

https://chinghwayu.com/2021/11/how-to-create-a-python-plugin-system-with-stevedore/

Simpler: https://stackoverflow.com/questions/67631/how-can-i-import-a-module-dynamically-given-the-full-path

Pipeline representation for Virgo

Adapt to new pipeline representation for ML workflows

Pipeline representation for CMCC

Adapt to new pipeline representation for ML workflows

Design T6.5 conceputal achitecture diagrams

Design T6.5 arch conceptual diagrams including input from requirements analysis and interfaces with workflow composition and "quality and validation" modules.

Integrate toy use case

Validate Workflow v0 with a toy use case, like MNIST images classification or generation with a GAN model.

Implement `itwinai v0.0` - MWB

Implement a minimal working baseline for the AI module, following the requirements defined for itwinai v0.0.

Develop PoC for Kubeflow ML workflows

Kubeflow is an interesting ML workflow tool which provides in the same place, (apparently) all the functionalities requried by advanced MLOps workflows.

It is important to explore Kubeflow, and develop a simple PoC to better evaluate it.

Add Python tests + GH actions support

Write tests with pytest for AI module, testing also GH actions for automatic testing: see these docs

Generic logging for Virgo use case

Integrate generic TF logging for Virgo use case

Requirements analysis for D6.1

Requirements analysis involves the following steps, roughly:

Collect and understand AI/ML requirements form use cases and EC (project proposal), organising them in a structured way (e.g., table)
Elaborate high-level requiremens, producing new requirements for lower-level infrastructure/software systems

Add Python tests + GH actions support

Write tests with pytest for AI module, testing also GH actions for automatic testing: see these docs

CWL support for logging and storing: mounting working dir in tmp dir

The current CWL support is limited to running the v0.0 based on the MNIST dataset without logging or model storage. The cwltool gets invoked in the /tmp directory where it gets executed. If output files or directories are to be stored, they need to be specified in the .cwl file in order for the cwltool to link them to /tmp, i.e. creating directories on-the-fly with e.g. os.mkdir(), etc do not work.

Things to look into:

find way to mount the working dir in the /tmp dir where execution happen
otherwise adjust workflow and predefine files and directories (suboptimal solution)

Migrate to containerized workflows for MNIST (torch)

Workflow steps are currently Python environments. To integrate with the infrastructure, they must be converted into containers, orchestrated by, e.g., Apache Airflow.

Goal: execute DT workflows on a Kubernetes cluster where each step is deployed as an independent container. Orchestrations can be achieved by means of some "advanced" orchestrator (e.g., Airflow), or by executing a DT workflow step-by-step in a very similar way as it is currently done by run-workflow.py now. The only difference is that the command is executed in a container, rather that in a Python virual environment.

Related issue: #52

intertwin-eu / itwinai Goto Github PK

itwinai's Introduction

PoC for AI-centric digital twin workflows

Installation

Micromamba installation

Workflow orchestrator

Documentation folder

Development env setup

AI environment setup

Updating the environment files

itwinai's People

Contributors

Stargazers

Watchers

Forkers

itwinai's Issues

Goal

Alternatives

Background

Containers

Sub-workflows

Goal

Background

Recommend Projects

Recommend Topics

Recommend Org