intertwin-eu / itwinai Goto Github PK

View Code? Open in Web Editor NEW

12.0 3.0 5.0 63.01 MB

Advanced AI workflows for digital twins applications in science.

Home Page: https://itwinai.readthedocs.io

License: MIT License

Python 48.95% Shell 4.89% Makefile 0.64% Jupyter Notebook 45.41% Dockerfile 0.10%

itwinai's Issues

Update torch distributed trainer to support GANs

Distributed training TensorFlow MNIST

use MirroredStrategy

Pipeline representation for ML workflows (MNIST)

Allows modularity and code reuse, providing interTwin use cases off-the-shelf operations which they can reuse form other use cases, or extend if needed. It is the result of the latest developments in our code base in the last months.

Integrate workflow v0 on HPC infrastructure

Validate Workflow v0 on HPC infrastructure.

At the moment workflow steps are executed inside conda environments (created using mamba), but HPC systems may not support conda. The advantage of using conda is that it allows to easily build any python version, but this solution has to be validated againsta HPC policies and best practices.

Actions

Run on HPC system
Optimize according to HPC best practices

Assessment of openEO suitability for AI workflows composition

Analize the viability of openEO to

orchestrate AI/ML workflows
trigger AI/ML workflows implemented outside of openEO engine (e.g., in a container)

Implement `itwinai v0.1` - Consolidated AI training

Improve over the minimal working baseline for the AI module, following the requirements defined for itwinai v0.1.

Pipeline representation for CMCC

Adapt to new pipeline representation for ML workflows

Generic logging for Virgo use case

Integrate generic TF logging for Virgo use case

Agree on Python docstring format

Agree on the format of docstrings used in Python modules.

Suggestions: https://marketplace.visualstudio.com/items?itemName=njpwerner.autodocstring

Define schemas for our YAML files

As explained here

Distributed torch (+ lightning) trainer for MNIST

Multi-GPU, data parallel

Generic logging for CERN use case

Integrate generic TF logging for CERN use case

Distributed training for Virgo

Migrate to CWL workflow definition

Currently a digital twin workflow is defined using YAML files witha custom format. Migrate to the standard CWL workflow composition and engine.

MLFlow with object storage

Implement model registry using object storage to unluck full potential of MLFLow using databases.

Add Python tests + GH actions support

Write tests with pytest for AI module, testing also GH actions for automatic testing: see these docs

Define interaction with workflow composition (T6.1)

Goal

Define how to handle "triggers" received by the workflow composition tool, namely the orchestrator in the DTE core. This is infrastructure dependent.

Alternatives

openEO's UDF (+ REST API?)
HTTP hook
Message queue
Other?

Background

Containers

After analysing requirements from use cases (WP4), DTE core (WP6), and infrastructure (WP5), it is likely that the "common language" is containers (e.g., Docker, Singularity).
Infrastructure may not work at workflow/"pipeline" level, thus this abstraction would be implemented by T6.1. Namely, T6.1 is going to:
- Break workflows into steps, and deploy the steps on the architecture as containers
- Listen for events (e.g., new data is available: dCache + Nifi + OSCAR), and trigger the corresponding workflow (e.g., new training data is available, then trigger ML training).
- Run a workflow step-by-step: once a step (container) has completed, trigger the next one using Kubernetes-like APIs.

Sub-workflows

Different workflow steps may be run using different (sub-)workflow engines. For instance:

Big data pre-processing of satellite images shall be carried out using openEO workflows (e.g., using Spark would be sub-optimal).
AI/ML could be carried out by using Kubeflow or Elyra workflows, which are AI-centric.
And so on...

As a result, in the same DT workflow we may use multiple workflow engines, which are tailored/optimized for the needs of each workflow step. The goal is to give freedom to individual tasks to use their preferred workflow engine, which is often the more optimized one for that task.

Conceptually, T6.1 is developing a workflow manager, which is working super partes, thus implementing a "Super orchestrator/workflow manager". This high-level orchestrator is agnostic from the operation executed in each node. It may work with common workflow language.

Below, an example of a toy workflow concerning the prediction of fire risk maps on satellite images, concerning the following steps:

Big data pre-processing: transform training and inference (i.e., unseen) satellite images, preparing them for ML workflows. In principle, both training and inference images may be pre-processed in the same way.
GAN neural network training: train a ML model to predict fire risk maps, and save it to the models registry.
Data fusion and visualization: apply the trained ML model on unseen satellite images and show fire risk predictions to the user (visualization component).

In this case:

the "super orchestrator" may be developed by T6.1
"Big data pre-processing" may be developed by T6.4
"GAN neural net training" may be developed by T6.5
"Data fusion and visualization" may be developed by T6.3

NOTE: In practice, the "super orchestrator" could be implemented by re-using one of the engines required by some task, with reduced maintenance cost. However, it has to support for general-purpose workflow, which steps are deployed as containers.

Distributed training (TF) for CMCC

Implement TF distributed trainer for CMCC use case:

TensorFlow distributed trainer class (#66)

Move documentation from wiki to repo

This is needed to have a self contained release, in which docs are consistent with code.

Also, links to other ML diles should be relative. See here

Create docs webpage

For instance with Jekyll

Add test with pytest

Tests are needed on the ai component, namely in the ./ai subfolder, in which most of the development is going to occurr. The other code is less critical and tests can be skipped.

Containerized workflows on HPC

Workflow steps are currently Python environments. To integrate with the infrastructure, they must be converted into containers, orchestrated by, e.g., Apache Airflow.

Goal: execute DT workflows on a Kubernetes cluster where each step is deployed as an independent container. Orchestrations can be achieved by means of some "advanced" orchestrator (e.g., Airflow), or by executing a DT workflow step-by-step in a very similar way as it is currently done by run-workflow.py now. The only difference is that the command is executed in a container, rather that in a Python virtual environment.

Containers for itwinai distributed ML
tutorial on how to use containers

Integrate cycleGAN into torch distributed trainer

Add Python tests + GH actions support

Write tests with pytest for AI module, testing also GH actions for automatic testing: see these docs

Add Python tests + GH actions support

Write tests with pytest for AI module, testing also GH actions for automatic testing: see these docs

Add Python tests + GH actions support

Write tests with pytest for AI module, testing also GH actions for automatic testing: see these docs

Implement `itwinai v0.0` - MWB

Implement a minimal working baseline for the AI module, following the requirements defined for itwinai v0.0.

Generic logger (Torch) for MNIST

Generic logger interface for logging in a Torch training loop. Support MLFlow, WandB and Tensorboard.

Integrate Apache Airflow with CWL

Prove that Apache Airflow can run worklfows written according to CWL definition.

Goal: execute our CWL workflows on a cluster with Apache Airflow. Apache Airflow shoud replace run-workflow.py script in this repository.

To begin with, start by defining a a simple CWL workflow of, say, 3 steps and execute it with Apache Airflow (has 3 steps in Airflow as well)

flowchart LR
  a(print 'a')
  b(print 'b')
  c(print 'c')
  a --> b --> c

It is of high priority to integrate with FZJ and WP5.

Generic logger for CMCC

Integrate generic TF logging for CMCC use case

Generic logger (TF) for MNIST

Generic logger interface for logging in a TF training loop. Support MLFlow, WandB and Tensorboard.

itwinai not supported on MacOS

Extend the support to other OSs for quick prototyping

Analyze openEO suitability for high energy physics

This is a bit off-topic from hard T6.5. However, we are intersted in analyzing how openEO could be extended to high energy physics (HEP) domains, like detector simulation use case (T4.2).

For instance, openEO could be used in the pre-processing phase, to convert ROOT file format into HDF5.

This is important because is would allow to scale T4.2 to big data.

Advanced YAML configuration files as an input for workflow steps

Define an in-project standard to complile configuration files, to instantiate each step in a digital twin workflow.

An example for AI module:

Tasks:

#31 as explained here
Include YAML schema validation when loading YAML conf file from some use case
Include YAML schema validation in GH actions. when pushing YAML conf files to the repo

Distributed training for CERN use case

Update the core API, if needed, to support distributed training for CERN use case (pytorch lightning and tensorflow)

CWL support for logging and storing: mounting working dir in tmp dir

The current CWL support is limited to running the v0.0 based on the MNIST dataset without logging or model storage. The cwltool gets invoked in the /tmp directory where it gets executed. If output files or directories are to be stored, they need to be specified in the .cwl file in order for the cwltool to link them to /tmp, i.e. creating directories on-the-fly with e.g. os.mkdir(), etc do not work.

Things to look into:

find way to mount the working dir in the /tmp dir where execution happen
otherwise adjust workflow and predefine files and directories (suboptimal solution)

Support for custom models

For lightning, take into account @..._REGISTRY decorators:

from pytorch_lightning.utilities.cli import CALLBACK_REGISTRY
from pytorch_lightning.utilities.cli import MODEL_REGISTRY, DATAMODULE_REGISTRY

Consider this tool: https://github.com/openstack/stevedore

https://chinghwayu.com/2021/11/how-to-create-a-python-plugin-system-with-stevedore/

Simpler: https://stackoverflow.com/questions/67631/how-can-i-import-a-module-dynamically-given-the-full-path

Define iteraction with "Quality and uncertainty tracing" module (T6.2)

"Quality and uncertainty tracing" module is going to provide the end user with tools to easily evaluate digital twin (DT) models and workflows. Generally speaking, ML validation is carried out by T6.5 (e.g., training v. validation v. test metrics comparison), whereas the "Quality and uncertainty tracing" module provides:

more advanced validations of the trained model (e.g., physics-based validations)
easy-to-use CI/CD pipelines through the SQAaaS module, which allow to validate
- FAIRness data principles
- source code
- (micro)services, in a black-blox manner

Goal

Define interface between AI/ML workflows (T6.5) and "Quality and uncertainty tracing" module. In the case of ML models, the validation may consist of:

re-loading a pre-trained model from the models registry and perform advanced validations required by the use case (e.g., physics-based validations), or alternatively, perform black-box test of a ML deployment (e.g., a NN in a container) using test cases provided by T6.5, without the need of directly interacting with the models registry.
provide feedback on FAIR principles compliance of the "models registry".
CI/CD black-box testing AI/ML workflows, treated as microservices (including ML deployment)
CI/CD source code validation

Background

A main challenge consists in the validation of a model trained in an online fashion.

Documentation of new pipeline API

Update the docs under ./docs folder to reflect the changes in the code base. Add tutorials.

Integrate pix2pix into torch distributed trainer

Easy way to import models in inference mode

Alternative to MLFlow's runID

Integrate CMCC use case

Integrate CMCC use case into prototype.
Code repository can be found here.
Data is accessible via google drive here.

There are multiple tasks to be done:

check that code is working as provided in the repo
split use case in preprocessing, training, validation/post-processing part
integrate by porting to PyTorch or providing TF support

Pipeline representation for Virgo

Adapt to new pipeline representation for ML workflows

High-level documentation

Let's produce some high-level quick-to-maintain docs to help us explain what this MLOps platform does.

General architecture / concept: see the wiki
UI for use case
How to use (e.g., workflow definition)

Let's use the wiki for now

Integrate CERN use case

Tasks:

Adapt CERN use case to PyTorch Lightning starting from this tutorial
Integrate use case into itwinai followin this guide
Write tests under tests/use-cases/ for the newly integrated use case

Implement a minimal working codebase doing distributed learning using horovod
Conduct performance studies comparing horovod to native TF and PyTorch implementation.

Requirements analysis for D6.1

Requirements analysis involves the following steps, roughly:

Collect and understand AI/ML requirements form use cases and EC (project proposal), organising them in a structured way (e.g., table)
Elaborate high-level requiremens, producing new requirements for lower-level infrastructure/software systems

Design T6.5 conceputal achitecture diagrams

Design T6.5 arch conceptual diagrams including input from requirements analysis and interfaces with workflow composition and "quality and validation" modules.

intertwin-eu / itwinai Goto Github PK

itwinai's Issues

Goal

Alternatives

Background

Containers

Sub-workflows

Goal

Background

Recommend Projects

Recommend Topics

Recommend Org