Git Product home page Git Product logo

itwinai's Introduction

PoC for AI-centric digital twin workflows

GitHub Super-Linter GitHub Super-Linter SQAaaS source code

See the latest version of our docs for a quick overview of this platform for advanced AI/ML workflows in digital twin applications.

If you want to integrate a new use case, you can follow this step-by-step guide.

Installation

Requirements:

  • Linux, macOS environment. Windows was never tested.

Micromamba installation

To manage Conda environments we use micromamba, a light weight version of conda.

It is suggested to refer to the Manual installation guide.

Consider that Micromamba can eat a lot of space when building environments because packages are cached on the local filesystem after being downloaded. To clear cache you can use micromamba clean -a. Micromamba data are kept under the $HOME location. However, in some systems, $HOME has a limited storage space and it would be cleverer to install Micromamba in another location with more storage space. Thus by changing the $MAMBA_ROOT_PREFIX variable. See a complete installation example for Linux below, where the default $MAMBA_ROOT_PREFIX is overridden:

cd $HOME

# Download micromamba (This command is for Linux Intel (x86_64) systems. Find the right one for your system!)
curl -Ls https://micro.mamba.pm/api/micromamba/linux-64/latest | tar -xvj bin/micromamba

# Install micromamba in a custom directory
MAMBA_ROOT_PREFIX='my-mamba-root'
./bin/micromamba shell init $MAMBA_ROOT_PREFIX

# To invoke micromamba from Makefile, you need to add explicitly to $PATH
echo 'PATH="$(dirname $MAMBA_EXE):$PATH"' >> ~/.bashrc

Reference: Micromamba installation guide.

Workflow orchestrator

Install the (custom) orchestrator virtual environment.

source ~/.bashrc
# Create local env
make

# Activate env
micromamba activate ./.venv

To run tests on workflows use:

# Activate env
micromamba activate ./.venv

pytest tests/

Documentation folder

Documentation for this repository is maintained under ./docs location. If you are using code from a previous release, you can build the docs webpage locally using these instructions.

Development env setup

Requirements:

  • Linux, macOS environment. Windows was never tested.
  • Micromamba: see the installation instructions above.
  • VS Code, for development.

Installation:

make dev-env

# Activate env
micromamba activate ./.venv-dev

To run tests on itwinai package:

# Activate env
micromamba activate ./.venv-dev

pytest tests/ai/

AI environment setup

Requirements:

  • Linux, macOS environment. Windows was never tested.
  • Micromamba: see the installation instructions above.
  • VS Code, for development.

NOTE: this environment gets automatically setup when a workflow is executed!

However, you can also set it up explicitly with:

make ai-env

# Activate env
micromamba activate ./ai/.venv-pytorch

Updating the environment files

The files under ai/env-files/ are of two categories:

  • Simple environment definition, such as pytorch-env.yml and pytorch-env-gpu.yml
  • Lockfiles, such as pytorch-lock.yml and pytorch-gpu-lock.yml, generated by conda-lock.

When you install the ai environment, install it from the lock file!

When the "simple" environment file (e.g., pytorch-env.yml) changes, lock it with conda-lock:

micromamba activate ./.venv

make lock-ai

itwinai's People

Contributors

matbun avatar dependabot[bot] avatar mrgweep avatar andrea-manzi avatar orviz avatar r-sarma avatar

Stargazers

 avatar Iacopo avatar  avatar Estíbaliz Parcero avatar Eric Wulff avatar  avatar  avatar Brian Pondi avatar  avatar

Watchers

 avatar Sebastian Luna-Valero avatar  avatar

itwinai's Issues

Create Horovod Testbed

Horovod is framework for distributed machine learning, that supports both Tensorflow and PyTorch and also provides a single interface for both of them. It is therefore of interest to T6.5 and will be investigated. Specific work to be done is:

  1. Implement a minimal working codebase doing distributed learning using horovod
  2. Conduct performance studies comparing horovod to native TF and PyTorch implementation.

Integrate CMCC use case

Integrate CMCC use case into prototype.
Code repository can be found here.
Data is accessible via google drive here.

There are multiple tasks to be done:

  • check that code is working as provided in the repo
  • split use case in preprocessing, training, validation/post-processing part
  • integrate by porting to PyTorch or providing TF support

Advanced YAML configuration files as an input for workflow steps

Define an in-project standard to complile configuration files, to instantiate each step in a digital twin workflow.

An example for AI module:

image

Tasks:

  • #31 as explained here
  • Include YAML schema validation when loading YAML conf file from some use case
  • Include YAML schema validation in GH actions. when pushing YAML conf files to the repo

Integrate Apache Airflow with CWL

Prove that Apache Airflow can run worklfows written according to CWL definition.

Goal: execute our CWL workflows on a cluster with Apache Airflow. Apache Airflow shoud replace run-workflow.py script in this repository.

To begin with, start by defining a a simple CWL workflow of, say, 3 steps and execute it with Apache Airflow (has 3 steps in Airflow as well)

flowchart LR
  a(print 'a')
  b(print 'b')
  c(print 'c')
  a --> b --> c

It is of high priority to integrate with FZJ and WP5.

Integrate CERN use case

Tasks:

  • Adapt CERN use case to PyTorch Lightning starting from this tutorial
  • Integrate use case into itwinai followin this guide
  • Write tests under tests/use-cases/ for the newly integrated use case

Add test with pytest

Tests are needed on the ai component, namely in the ./ai subfolder, in which most of the development is going to occurr. The other code is less critical and tests can be skipped.

Pipeline representation for ML workflows (MNIST)

Allows modularity and code reuse, providing interTwin use cases off-the-shelf operations which they can reuse form other use cases, or extend if needed. It is the result of the latest developments in our code base in the last months.

Define interaction with workflow composition (T6.1)

Goal

Define how to handle "triggers" received by the workflow composition tool, namely the orchestrator in the DTE core. This is infrastructure dependent.

Alternatives

Background

Containers

After analysing requirements from use cases (WP4), DTE core (WP6), and infrastructure (WP5), it is likely that the "common language" is containers (e.g., Docker, Singularity).
Infrastructure may not work at workflow/"pipeline" level, thus this abstraction would be implemented by T6.1. Namely, T6.1 is going to:
- Break workflows into steps, and deploy the steps on the architecture as containers
- Listen for events (e.g., new data is available: dCache + Nifi + OSCAR), and trigger the corresponding workflow (e.g., new training data is available, then trigger ML training).
- Run a workflow step-by-step: once a step (container) has completed, trigger the next one using Kubernetes-like APIs.

image

Sub-workflows

Different workflow steps may be run using different (sub-)workflow engines. For instance:

  • Big data pre-processing of satellite images shall be carried out using openEO workflows (e.g., using Spark would be sub-optimal).
  • AI/ML could be carried out by using Kubeflow or Elyra workflows, which are AI-centric.
  • And so on...

As a result, in the same DT workflow we may use multiple workflow engines, which are tailored/optimized for the needs of each workflow step. The goal is to give freedom to individual tasks to use their preferred workflow engine, which is often the more optimized one for that task.

Conceptually, T6.1 is developing a workflow manager, which is working super partes, thus implementing a "Super orchestrator/workflow manager". This high-level orchestrator is agnostic from the operation executed in each node. It may work with common workflow language.

Below, an example of a toy workflow concerning the prediction of fire risk maps on satellite images, concerning the following steps:

  • Big data pre-processing: transform training and inference (i.e., unseen) satellite images, preparing them for ML workflows. In principle, both training and inference images may be pre-processed in the same way.
  • GAN neural network training: train a ML model to predict fire risk maps, and save it to the models registry.
  • Data fusion and visualization: apply the trained ML model on unseen satellite images and show fire risk predictions to the user (visualization component).

image

In this case:

  • the "super orchestrator" may be developed by T6.1
  • "Big data pre-processing" may be developed by T6.4
  • "GAN neural net training" may be developed by T6.5
  • "Data fusion and visualization" may be developed by T6.3

NOTE: In practice, the "super orchestrator" could be implemented by re-using one of the engines required by some task, with reduced maintenance cost. However, it has to support for general-purpose workflow, which steps are deployed as containers.

Define iteraction with "Quality and uncertainty tracing" module (T6.2)

"Quality and uncertainty tracing" module is going to provide the end user with tools to easily evaluate digital twin (DT) models and workflows. Generally speaking, ML validation is carried out by T6.5 (e.g., training v. validation v. test metrics comparison), whereas the "Quality and uncertainty tracing" module provides:

  • more advanced validations of the trained model (e.g., physics-based validations)
  • easy-to-use CI/CD pipelines through the SQAaaS module, which allow to validate
    • FAIRness data principles
    • source code
    • (micro)services, in a black-blox manner

Goal

Define interface between AI/ML workflows (T6.5) and "Quality and uncertainty tracing" module. In the case of ML models, the validation may consist of:

  • re-loading a pre-trained model from the models registry and perform advanced validations required by the use case (e.g., physics-based validations), or alternatively, perform black-box test of a ML deployment (e.g., a NN in a container) using test cases provided by T6.5, without the need of directly interacting with the models registry.
  • provide feedback on FAIR principles compliance of the "models registry".
  • CI/CD black-box testing AI/ML workflows, treated as microservices (including ML deployment)
  • CI/CD source code validation

Background

A main challenge consists in the validation of a model trained in an online fashion.

Analyze openEO suitability for high energy physics

This is a bit off-topic from hard T6.5. However, we are intersted in analyzing how openEO could be extended to high energy physics (HEP) domains, like detector simulation use case (T4.2).

For instance, openEO could be used in the pre-processing phase, to convert ROOT file format into HDF5.

This is important because is would allow to scale T4.2 to big data.

Integrate workflow v0 on HPC infrastructure

Validate Workflow v0 on HPC infrastructure.

At the moment workflow steps are executed inside conda environments (created using mamba), but HPC systems may not support conda. The advantage of using conda is that it allows to easily build any python version, but this solution has to be validated againsta HPC policies and best practices.

Actions

  • Run on HPC system
  • Optimize according to HPC best practices

See also:

High-level documentation

Let's produce some high-level quick-to-maintain docs to help us explain what this MLOps platform does.

  • General architecture / concept: see the wiki
  • UI for use case
  • How to use (e.g., workflow definition)

Let's use the wiki for now

Integrate toy use case

Validate Workflow v0 with a toy use case, like MNIST images classification or generation with a GAN model.

  • Write MNIST doc here
  • Split dataset
  • Write inference workflow: see here
  • Add input dataset suppport to inference
  • Improve LitMNIST
  • Write tests: #18

Develop PoC for Kubeflow ML workflows

Kubeflow is an interesting ML workflow tool which provides in the same place, (apparently) all the functionalities requried by advanced MLOps workflows.

It is important to explore Kubeflow, and develop a simple PoC to better evaluate it.

Requirements analysis for D6.1

Requirements analysis involves the following steps, roughly:

  1. Collect and understand AI/ML requirements form use cases and EC (project proposal), organising them in a structured way (e.g., table)
  2. Elaborate high-level requiremens, producing new requirements for lower-level infrastructure/software systems

CWL support for logging and storing: mounting working dir in tmp dir

The current CWL support is limited to running the v0.0 based on the MNIST dataset without logging or model storage. The cwltool gets invoked in the /tmp directory where it gets executed. If output files or directories are to be stored, they need to be specified in the .cwl file in order for the cwltool to link them to /tmp, i.e. creating directories on-the-fly with e.g. os.mkdir(), etc do not work.

Things to look into:

  • find way to mount the working dir in the /tmp dir where execution happen
  • otherwise adjust workflow and predefine files and directories (suboptimal solution)

Migrate to containerized workflows for MNIST (torch)

Workflow steps are currently Python environments. To integrate with the infrastructure, they must be converted into containers, orchestrated by, e.g., Apache Airflow.

Goal: execute DT workflows on a Kubernetes cluster where each step is deployed as an independent container. Orchestrations can be achieved by means of some "advanced" orchestrator (e.g., Airflow), or by executing a DT workflow step-by-step in a very similar way as it is currently done by run-workflow.py now. The only difference is that the command is executed in a container, rather that in a Python virual environment.

Related issue: #52

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.