intertwin-eu / itwinai Goto Github PK
View Code? Open in Web Editor NEWAdvanced AI workflows for digital twins applications in science.
Home Page: https://itwinai.readthedocs.io
License: MIT License
Advanced AI workflows for digital twins applications in science.
Home Page: https://itwinai.readthedocs.io
License: MIT License
Tests are needed on the ai component, namely in the ./ai
subfolder, in which most of the development is going to occurr. The other code is less critical and tests can be skipped.
Agree on the format of docstrings used in Python modules.
Suggestions: https://marketplace.visualstudio.com/items?itemName=njpwerner.autodocstring
Integrate generic TF logging for CMCC use case
use MirroredStrategy
Alternative to MLFlow's runID
Validate Workflow v0 with a toy use case, like MNIST images classification or generation with a GAN model.
Allows modularity and code reuse, providing interTwin use cases off-the-shelf operations which they can reuse form other use cases, or extend if needed. It is the result of the latest developments in our code base in the last months.
"Quality and uncertainty tracing" module is going to provide the end user with tools to easily evaluate digital twin (DT) models and workflows. Generally speaking, ML validation is carried out by T6.5 (e.g., training v. validation v. test metrics comparison), whereas the "Quality and uncertainty tracing" module provides:
Define interface between AI/ML workflows (T6.5) and "Quality and uncertainty tracing" module. In the case of ML models, the validation may consist of:
A main challenge consists in the validation of a model trained in an online fashion.
Kubeflow is an interesting ML workflow tool which provides in the same place, (apparently) all the functionalities requried by advanced MLOps workflows.
It is important to explore Kubeflow, and develop a simple PoC to better evaluate it.
Let's produce some high-level quick-to-maintain docs to help us explain what this MLOps platform does.
Let's use the wiki for now
This is needed to have a self contained release, in which docs are consistent with code.
Also, links to other ML diles should be relative. See here
Tasks:
tests/use-cases/
for the newly integrated use caseRequirements analysis involves the following steps, roughly:
Write tests with pytest for AI module, testing also GH actions for automatic testing: see these docs
Analize the viability of openEO to
The current CWL support is limited to running the v0.0
based on the MNIST dataset without logging or model storage. The cwltool
gets invoked in the /tmp
directory where it gets executed. If output files or directories are to be stored, they need to be specified in the .cwl
file in order for the cwltool
to link them to /tmp
, i.e. creating directories on-the-fly with e.g. os.mkdir()
, etc do not work.
Things to look into:
/tmp
dir where execution happenAdapt to new pipeline representation for ML workflows
Multi-GPU, data parallel
Integrate generic TF logging for Virgo use case
Integrate generic TF logging for CERN use case
Define how to handle "triggers" received by the workflow composition tool, namely the orchestrator in the DTE core. This is infrastructure dependent.
After analysing requirements from use cases (WP4), DTE core (WP6), and infrastructure (WP5), it is likely that the "common language" is containers (e.g., Docker, Singularity).
Infrastructure may not work at workflow/"pipeline" level, thus this abstraction would be implemented by T6.1. Namely, T6.1 is going to:
- Break workflows into steps, and deploy the steps on the architecture as containers
- Listen for events (e.g., new data is available: dCache + Nifi + OSCAR), and trigger the corresponding workflow (e.g., new training data is available, then trigger ML training).
- Run a workflow step-by-step: once a step (container) has completed, trigger the next one using Kubernetes-like APIs.
Different workflow steps may be run using different (sub-)workflow engines. For instance:
As a result, in the same DT workflow we may use multiple workflow engines, which are tailored/optimized for the needs of each workflow step. The goal is to give freedom to individual tasks to use their preferred workflow engine, which is often the more optimized one for that task.
Conceptually, T6.1 is developing a workflow manager, which is working super partes, thus implementing a "Super orchestrator/workflow manager". This high-level orchestrator is agnostic from the operation executed in each node. It may work with common workflow language.
Below, an example of a toy workflow concerning the prediction of fire risk maps on satellite images, concerning the following steps:
In this case:
NOTE: In practice, the "super orchestrator" could be implemented by re-using one of the engines required by some task, with reduced maintenance cost. However, it has to support for general-purpose workflow, which steps are deployed as containers.
Design T6.5 arch conceptual diagrams including input from requirements analysis and interfaces with workflow composition and "quality and validation" modules.
Workflow steps are currently Python environments. To integrate with the infrastructure, they must be converted into containers, orchestrated by, e.g., Apache Airflow.
Goal: execute DT workflows on a Kubernetes cluster where each step is deployed as an independent container. Orchestrations can be achieved by means of some "advanced" orchestrator (e.g., Airflow), or by executing a DT workflow step-by-step in a very similar way as it is currently done by run-workflow.py
now. The only difference is that the command is executed in a container, rather that in a Python virtual environment.
For lightning, take into account @..._REGISTRY
decorators:
from pytorch_lightning.utilities.cli import CALLBACK_REGISTRY
from pytorch_lightning.utilities.cli import MODEL_REGISTRY, DATAMODULE_REGISTRY
Consider this tool: https://github.com/openstack/stevedore
https://chinghwayu.com/2021/11/how-to-create-a-python-plugin-system-with-stevedore/
Simpler: https://stackoverflow.com/questions/67631/how-can-i-import-a-module-dynamically-given-the-full-path
Implement model registry using object storage to unluck full potential of MLFLow using databases.
Define an in-project standard to complile configuration files, to instantiate each step in a digital twin workflow.
An example for AI module:
Tasks:
Update the docs under ./docs
folder to reflect the changes in the code base. Add tutorials.
Currently a digital twin workflow is defined using YAML files witha custom format. Migrate to the standard CWL workflow composition and engine.
Prove that Apache Airflow can run worklfows written according to CWL definition.
Goal: execute our CWL workflows on a cluster with Apache Airflow. Apache Airflow shoud replace run-workflow.py
script in this repository.
To begin with, start by defining a a simple CWL workflow of, say, 3 steps and execute it with Apache Airflow (has 3 steps in Airflow as well)
flowchart LR
a(print 'a')
b(print 'b')
c(print 'c')
a --> b --> c
It is of high priority to integrate with FZJ and WP5.
Generic logger interface for logging in a TF training loop. Support MLFlow, WandB and Tensorboard.
Implement TF distributed trainer for CMCC use case:
Implement a minimal working baseline for the AI module, following the requirements defined for itwinai v0.0
.
For instance with Jekyll
Integrate CMCC use case into prototype.
Code repository can be found here.
Data is accessible via google drive here.
There are multiple tasks to be done:
This is a bit off-topic from hard T6.5. However, we are intersted in analyzing how openEO could be extended to high energy physics (HEP) domains, like detector simulation use case (T4.2).
For instance, openEO could be used in the pre-processing phase, to convert ROOT file format into HDF5.
This is important because is would allow to scale T4.2 to big data.
As explained here
Write tests with pytest for AI module, testing also GH actions for automatic testing: see these docs
Extend the support to other OSs for quick prototyping
Generic logger interface for logging in a Torch training loop. Support MLFlow, WandB and Tensorboard.
Horovod is framework for distributed machine learning, that supports both Tensorflow and PyTorch and also provides a single interface for both of them. It is therefore of interest to T6.5 and will be investigated. Specific work to be done is:
Write tests with pytest for AI module, testing also GH actions for automatic testing: see these docs
Validate Workflow v0 on HPC infrastructure.
At the moment workflow steps are executed inside conda environments (created using mamba), but HPC systems may not support conda. The advantage of using conda is that it allows to easily build any python version, but this solution has to be validated againsta HPC policies and best practices.
Actions
See also:
Update the core API, if needed, to support distributed training for CERN use case (pytorch lightning and tensorflow)
Adapt to new pipeline representation for ML workflows
Improve over the minimal working baseline for the AI module, following the requirements defined for itwinai v0.1
.
Write tests with pytest for AI module, testing also GH actions for automatic testing: see these docs
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.