Git Product home page Git Product logo

maestrowf's Introduction

Maestro Workflow Conductor (maestrowf)

Build Status PyPI Spack Issues Forks Stars License

Downloads Downloads

Maestro can be installed via pip:

pip install maestrowf

Documentation

Getting Started is Quick and Easy

Create a YAML file named study.yaml and paste the following content into the file:

description:
    name: hello_world
    description: A simple 'Hello World' study.

study:
    - name: say-hello
      description: Say hello to the world!
      run:
          cmd: |
            echo "Hello, World!" > hello_world.txt

PHILOSOPHY: Maestro believes in the principle of a clearly defined process, specified as a list of tasks, that are self-documenting and clear in their intent.

Running the hello_world study is as simple as...

maestro run study.yaml

Creating a Parameter Study is just as Easy

With the addition of the global.parameters block, and a few simple tweaks to your study block, the complete specification should look like this:

description:
    name: hello_planet
    description: A simple study to say hello to planets (and Pluto)

study:
    - name: say-hello
      description: Say hello to a planet!
      run:
          cmd: |
            echo "Hello, $(PLANET)!" > hello_$(PLANET).txt

global.parameters:
    PLANET:
        values: [Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune, Pluto]
        label: PLANET.%%

PHILOSOPHY: Maestro believes that a workflow should be easily parameterized with minimal modifications to the core process.

Maestro will automatically expand each parameter into its own isolated workspace, generate a script for each parameter, and automatically monitor execution of each task.

And, running the study is still as simple as:

    maestro run study.yaml

Scheduling Made Simple

But wait there's more! If you want to schedule a study, it's just as simple. With some minor modifications, you are able to run on an HPC system.

description:
    name: hello_planet
    description: A simple study to say hello to planets (and Pluto)

batch:
    type:  slurm
    queue: pbatch
    host:  quartz
    bank:  science

study:
    - name: say-hello
      description: Say hello to a planet!
      run:
          cmd: |
            echo "Hello, $(PLANET)!" > hello_$(PLANET).txt
          nodes: 1
          procs: 1
          walltime: "00:02:00"

global.parameters:
    PLANET:
        values: [Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune, Pluto]
        label: PLANET.%%

NOTE: This specification is configured to run on LLNL's quartz cluster. Under the batch header, you will need to make the necessary changes to schedule onto other HPC resources.

PHILOSOPHY: Maestro believes that how a workflow is defined should be decoupled from how it's run. We achieve this capability by providing a seamless interface to multiple schedulers that allows Maestro to readily port workflows to multiple platforms.

For other samples, see the samples subfolder. To continue with our Hello World example, see the Basics of Study Construction in our documentation.

An Example Study using LULESH

Maestro comes packed with a basic example using LULESH, a proxy application provided by LLNL. You can find the example here.

What is Maestro?

Maestro is an open-source HPC software tool that defines a YAML-based study specification for defining multistep workflows and automates execution of software flows on HPC resources. The core design tenants of Maestro focus on encouraging clear workflow communication and documentation, while making consistent execution easier to allow users to focus on science. Maestro's study specification helps users think about complex workflows in a step-wise, intent-oriented, manner that encourages modularity and tool reuse. These principles are becoming increasingly important as computational science is continuously more present in scientific fields and has started to require a similar rigor to physical experiment. Maestro is currently in use for multiple projects at Lawrence Livermore National Laboratory and has been used to run existing codes including MFEM, and other simulation codes. It has also been used in other areas including in the training of machine-learned models and more.

Maestro's Foundation and Core Concepts

There are many definitions of workflow, so we try to keep it simple and define the term as follows:

A set of high level tasks to be executed in some order, with or without dependencies on each other.

We have designed Maestro around the core concept of what we call a "study". A study is defined as a set of steps that are executed (a workflow) over a set of parameters. A study in Maestro's context is analogous to an actual tangible scientific experiment, which has a set of clearly defined and repeatable steps which are repeated over multiple specimen.

Maestro's core tenets are defined as follows:

Repeatability

A study should be easily repeatable. Like any well-planned and implemented science experiment, the steps themselves should be executed the exact same way each time a study is run over each set of parameters or over different runs of the study itself.

Consistent

Studies should be consistently documented and able to be run in a consistent fashion. The removal of variation in the process means less mistakes when executing studies, ease of picking up studies created by others, and uniformity in defining new studies.

Self-documenting

Documentation is important in computational studies as much as it is in physical science. The YAML specification defined by Maestro provides a few required key encouraging human-readable documentation. Even further, the specification itself is a documentation of a complete workflow.


Setting up your Python Environment

To get started, we recommend using virtual environments. If you do not have the Python virtualenv package installed, take a look at their official documentation to get started.

To create a new virtual environment:

python -m virtualenv maestro_venv
source maestro_venv/bin/activate

Getting Started for Contributors

If you plan to develop on Maestro, install the repository directly using:

pip install poetry
poetry install

Once set up, test the environment. The paths should point to a virtual environment folder.

which python
which pip

Using Maestro Dockerfiles

Maestro comes packaged with a set of Docker files for testing things out. The two primary files are:

  • A standard Dockerfile in the root of the Maestro repository. This file is a standard install of Maestro meant to try out Maestro on the demo samples provided with this repository. In order to try Maestro locally, with Docker installed run:

    docker build -t maestrowf .
    docker run -ti maestrowf
    

    From within the container run the following:

    maestro run ./maestrowf/samples/lulesh/lulesh_sample1_unix.yaml
    
  • In order to try out Flux 0.19.0 integration, from the root of the Maestro repository run the following:

    docker build -t flux_0190 -f ./docker/flux/0.19.0/Dockerfile .
    docker run -ti flux_0190
    

    From within the container run the following:

    maestro run ./maestrowf/samples/lulesh/lulesh_sample1_unix_flux.yaml
    

Contributors

Many thanks go to MaestroWF's contributors.

If you have any questions or to submit feature requests please open a ticket.


Release

MaestroWF is released under an MIT license. For more details see the NOTICE and LICENSE files.

LLNL-CODE-734340

maestrowf's People

Contributors

adrienbernede avatar ben-bay avatar bgunnar5 avatar crkrenn avatar dependabot[bot] avatar dinatale2 avatar doutriaux1 avatar frankd412 avatar gonsie avatar grosa1 avatar hoetmaaiers avatar ianlee1521 avatar jsemler avatar jwhite242 avatar kcathey avatar kennyweiss avatar kinow avatar koning avatar robinson96 avatar scottwedge avatar stevwonder avatar tanimislam avatar tobiasduswald avatar trws avatar youngjeffrey avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

maestrowf's Issues

Introduce Contributing Guidelines

I would recommend adding a basic Contributing Guidelines document to your project. You can place the CONTRIBUTING.md file under the .github directory in your repository.

In this document you would list various requirements before a PR is accepted such as:

  • Commit message formats/requirements (you also need to explore the need to have a Signed-off-by)
  • Coding styles you are enforcing
  • What steps a PR takes to get accepted (ex. must be code reviewed, must pass CI testing)
  • Other resources (ex. developer mailing lists)

A couple of examples from mpiFileUtils and ZFS on Linux.

Additional CLI hooks to validate and restart specifications.

Suggested by @jsemler.

I like the proposed suggestions.

A few additional suggestions for discussion:

maestro validate <study spec>

The ability to easily validate a study spec without needing to setup the study would be helpful.

In regards to the restart functionality that is mentioned above. I think something like the following could be useful:

maestro restart <optional step>

I'm initially thinking two major use cases would be to have MaestroWF figure out where to restart a study or let a user specify a given step to restart from.

Release of 1.1.0dev and Prioritization of next design phase

I wanted to spin up a discussion on the next phases of design for Maestro. This issue is going to be a place to note down thoughts and discuss the next steps and prioritization of next features. I'm planning to mark the completion of the 1.1.0dev release with the merge of the bugfix/dependency_ordering branch because that fixes a long standing issue of toplogically sorting the nodes before addition with a host of features and bugfixes that have been introduced in the meantime.


So let's start by summarizing the discussions and outstanding feature requests for major additions:

  • The introduction of services to the specification and ExecutionGraph (#77).
    • This feature opens up the ability to boot up and either execute locally or remotely programs and tools that we constitute as "services". These "services" can include (but are not limited to) such things as databases, daemons, APIs to external tools/services/etc., and the like.
    • The more complicated workflows become, the more likely services are going to be required. Being able to spin up these services before a study begins (and by implication forego starting a study if services fail to boot).
  • Specification of resources outside of nodes and procs (or tasks) (#76).
    • We're rapidly moving towards ubiquitous availability of compute resources that are not CPUs or cores. GPUs are available on the latest petaflop machines and other hardware is being researched for compute specific engines (machine learning etc.). A flexible way of specifying resources would be helpful both from Maestro ease of use and maintainability of the specification itself.
    • The specificity of resources also lends itself more and more to designating resources outside of a study specification and having it be its own input file passed into the maestro run <args> command line.
  • Definitions of command recipes (#71).
    • The discussion about recipes started and didn't really pick up. It's finding more relevance as there are more outstanding requests to be able to specify different flavors of MPI when scheduling jobs using an adapter (say, if a user wants to schedule with SLURM, they don't necessarily always want scripts generated using srun).
  • Restarting of a study (#95)
    • The restarting of a study at the top level sounds simple enough, but different users and use cases may specify the concept differently. I'd like to have this feature added in the next phase with either features that are general enough to be used by most users or with command options that allow it to be flexibly used.

The next points are things that I would like to achieve that don't necessary provide direct functionality but set the stage for future improvements and features.

  • Generating metadata for studies launched with Maestro (#93)
    • While metadata may not provide full provenance of a workflow/study, its standardized formatting and output provides useful information for tools built on top of a workflow. Tools that could use metadata would be cataloging services, archiving services, and post processing.
  • Creation of a backend object interface for managing study records.
    • Currently the only way Maestro tracks a study is the use of a pickle file in the study output directory.
      • The benefits to the pickle are that it is platform agnostic, does not depend on securing ports to external servers, and requires no additional packages or tools to make work (pickling is standard in Python).
      • The downsides are that it relies on the user's hard drive which means it inherently has the permissions set based on where the study is executed and the backend interface is not general making it difficult to port to other technologies in the future.
    • Moving record keeping behind a standard interface and re-writing the current coupling to pickle exercise the interface provides the first steps towards a standard interface (while allowing current functionality to be a test). The new interface opens up the ability to use other database technologies uniformly within Maestro.
    • The next steps would be to introduce access to a database and possibly a web service back end.
  • Ability to specify cluster information
    • Attempts to implement the LSF adapter have pointed out that the passing of cluster information may be required to achieve higher levels of flexibility. Job launchers like LSF require knowledge of the cluster to calculate resource sets. Additionally, having access to a cluster configuration allows for better validity checking of parameters for parallel jobs and batch submission.

Any thoughts are welcomed and appreciated.
@gonsie @dinatale2 @jsemler

"smarter" git clone

Use git clone --depth=1 or download an archive (rather than cloning with git). Just as a way to decrease the number of files.

Carry other information in yaml description

Maestro currently cannot handle keys with arbitrary names within the "description" key of the main .yaml specification. Adding this ability would allow users to carry other information here, which might be useful for study tracking.

This appears to require a change to the execution graph add_description method, such that it can accept arbitrary keyword/argument pairs and attach them to the ExDag, or simply ignore keywords other than name and description.

Example:

modify lulesh_sample1_macosx.yaml to add a "dummy" key in the description:
description:
    name: lulesh_sample1
    description: A sample LULESH study that downloads, builds, and runs a parame\
ter study of varying problem sizes and iterations.
    dummy: Dummy field to throw error.
....

Invoking

maestro run lulesh_sample1_macosx.yaml

produces the following error:

Traceback (most recent call last):
  File "maestrowf/venv/bin/maestro", line 11, in <module>
    load_entry_point('maestrowf', 'console_scripts', 'maestro')()
  File "maestrowf/maestrowf/maestro.py", line 375, in main
    rc = args.func(args)
  File "maestrowf/maestrowf/maestro.py", line 179, in run_study
    path, exec_dag = study.stage()
  File "maestrowf/maestrowf/datastructures/core/study.py", line 799, in stage
    return self._out_path, self._stage_parameterized()
  File "maestrowf/maestrowf/datastructures/core/study.py", line 399, in _stage_parameterized
    dag.add_description(**self.description)
TypeError: add_description() got an unexpected keyword argument 'dummy'

TypeError given for integer tokens

If there is an integer token from a YAML spec, it results in a TypeError. It should be able to handle ints.

maestrowf/datastructures/core/studyenvironement.py line 93.
if any(token in item.value for token in self._tokens):

Gives error:

TypeError: argument of type 'int' is not iterable

Upload sdist to pypi

For the version 1.0.0 release, you should have both the sdist and bdist versions available.

URL in source license comment

The license comment in the source files is missing a URL. The text reads:
# For details, see <URL describing code and how to download source>.

This appears at least 39 times throughout the repo.

Focus, once split, can't be recombined/pared down

When executing a study step over a list of values, you generate len(list) new .sh files. Following steps in the study also seem to generate len(list) new .sh files. This can be extremely useful for processing in-place, but some sort of way to go back to just one .sh for cleanup/etc. would be handy!

Update the API for SchedulerScriptAdapter.cancel_jobs to return per job status

SchedulerScriptAdapter.cancel_jobs(..) takes a job list and attempts to cancel the specified jobs in one command. The downside to this approach is that if the API to the scheduler returns a non-success code the method is required to return a CancelCode.ERROR. The method API should be modified to return a map of job identifier to cancellation status so that cancellation of each job can be determined.

Crash Log Traceback stuck at localscriptadapter.py

When i run a study and a step fails i have in the stderr trace in the .txt file but it is stuck at localscriptadapter.py. Here is a trace for SEM-Images-Stats.txt.

Traceback (most recent call last):
File "/Users/fyc/anaconda2/envs/demo-2.7/lib/python2.7/logging/init.py", line 861, in emit
msg = self.format(record)
File "/Users/fyc/anaconda2/envs/demo-2.7/lib/python2.7/logging/init.py", line 734, in format
return fmt.format(record)
File "/Users/fyc/anaconda2/envs/demo-2.7/lib/python2.7/logging/init.py", line 465, in format
record.message = record.getMessage()
File "/Users/fyc/anaconda2/envs/demo-2.7/lib/python2.7/logging/init.py", line 329, in getMessage
msg = msg % self.args
TypeError: not all arguments converted during string formatting
Logged from file localscriptadapter.py, line 138
Traceback (most recent call last):
File "/Users/fyc/anaconda2/envs/demo-2.7/lib/python2.7/logging/init.py", line 861, in emit
msg = self.format(record)
File "/Users/fyc/anaconda2/envs/demo-2.7/lib/python2.7/logging/init.py", line 734, in format
return fmt.format(record)
File "/Users/fyc/anaconda2/envs/demo-2.7/lib/python2.7/logging/init.py", line 465, in format
record.message = record.getMessage()
File "/Users/fyc/anaconda2/envs/demo-2.7/lib/python2.7/logging/init.py", line 329, in getMessage
msg = msg % self.args
TypeError: not all arguments converted during string formatting
Logged from file localscriptadapter.py, line 138
Traceback (most recent call last):
File "/Users/fyc/anaconda2/envs/demo-2.7/lib/python2.7/logging/init.py", line 861, in emit
msg = self.format(record)
File "/Users/fyc/anaconda2/envs/demo-2.7/lib/python2.7/logging/init.py", line 734, in format
return fmt.format(record)
File "/Users/fyc/anaconda2/envs/demo-2.7/lib/python2.7/logging/init.py", line 465, in format
record.message = record.getMessage()
File "/Users/fyc/anaconda2/envs/demo-2.7/lib/python2.7/logging/init.py", line 329, in getMessage
msg = msg % self.args
TypeError: not all arguments converted during string formatting
Logged from file localscriptadapter.py, line 138
Traceback (most recent call last):
File "/Users/fyc/anaconda2/envs/demo-2.7/lib/python2.7/logging/init.py", line 861, in emit
msg = self.format(record)
File "/Users/fyc/anaconda2/envs/demo-2.7/lib/python2.7/logging/init.py", line 734, in format
return fmt.format(record)
File "/Users/fyc/anaconda2/envs/demo-2.7/lib/python2.7/logging/init.py", line 465, in format
record.message = record.getMessage()
File "/Users/fyc/anaconda2/envs/demo-2.7/lib/python2.7/logging/init.py", line 329, in getMessage
msg = msg % self.args
TypeError: not all arguments converted during string formatting
Logged from file localscriptadapter.py, line 138
Traceback (most recent call last):
File "/Users/fyc/anaconda2/envs/demo-2.7/lib/python2.7/logging/init.py", line 861, in emit
msg = self.format(record)
File "/Users/fyc/anaconda2/envs/demo-2.7/lib/python2.7/logging/init.py", line 734, in format
return fmt.format(record)
File "/Users/fyc/anaconda2/envs/demo-2.7/lib/python2.7/logging/init.py", line 465, in format
record.message = record.getMessage()
File "/Users/fyc/anaconda2/envs/demo-2.7/lib/python2.7/logging/init.py", line 329, in getMessage
msg = msg % self.args
TypeError: not all arguments converted during string formatting
Logged from file localscriptadapter.py, line 138
Traceback (most recent call last):
File "/Users/fyc/anaconda2/envs/demo-2.7/lib/python2.7/logging/init.py", line 861, in emit
msg = self.format(record)
File "/Users/fyc/anaconda2/envs/demo-2.7/lib/python2.7/logging/init.py", line 734, in format
return fmt.format(record)
File "/Users/fyc/anaconda2/envs/demo-2.7/lib/python2.7/logging/init.py", line 465, in format
record.message = record.getMessage()
File "/Users/fyc/anaconda2/envs/demo-2.7/lib/python2.7/logging/init.py", line 329, in getMessage
msg = msg % self.args
TypeError: not all arguments converted during string formatting
Logged from file localscriptadapter.py, line 138
Traceback (most recent call last):
File "/Users/fyc/anaconda2/envs/demo-2.7/lib/python2.7/logging/init.py", line 861, in emit
msg = self.format(record)
File "/Users/fyc/anaconda2/envs/demo-2.7/lib/python2.7/logging/init.py", line 734, in format
return fmt.format(record)
File "/Users/fyc/anaconda2/envs/demo-2.7/lib/python2.7/logging/init.py", line 465, in format
record.message = record.getMessage()
File "/Users/fyc/anaconda2/envs/demo-2.7/lib/python2.7/logging/init.py", line 329, in getMessage
msg = msg % self.args
TypeError: not all arguments converted during string formatting
Logged from file localscriptadapter.py, line 138
Traceback (most recent call last):
File "/Users/fyc/anaconda2/envs/demo-2.7/lib/python2.7/logging/init.py", line 861, in emit
msg = self.format(record)
File "/Users/fyc/anaconda2/envs/demo-2.7/lib/python2.7/logging/init.py", line 734, in format
return fmt.format(record)
File "/Users/fyc/anaconda2/envs/demo-2.7/lib/python2.7/logging/init.py", line 465, in format
record.message = record.getMessage()
File "/Users/fyc/anaconda2/envs/demo-2.7/lib/python2.7/logging/init.py", line 329, in getMessage
msg = msg % self.args
TypeError: not all arguments converted during string formatting
Logged from file localscriptadapter.py, line 138
Traceback (most recent call last):
File "/Users/fyc/anaconda2/envs/demo-2.7/lib/python2.7/logging/init.py", line 861, in emit
msg = self.format(record)
File "/Users/fyc/anaconda2/envs/demo-2.7/lib/python2.7/logging/init.py", line 734, in format
return fmt.format(record)
File "/Users/fyc/anaconda2/envs/demo-2.7/lib/python2.7/logging/init.py", line 465, in format
record.message = record.getMessage()
File "/Users/fyc/anaconda2/envs/demo-2.7/lib/python2.7/logging/init.py", line 329, in getMessage
msg = msg % self.args
TypeError: not all arguments converted during string formatting
Logged from file localscriptadapter.py, line 138

Bug in Variable class verification: 0 value

The _verify method of the Variable class in variable.py checks self.value by calling bool(self.value). However, if self.value is 0, this returns False. A workaround for this case would be to require the value of a variable to be explicitly specified as a string or converted into a string if it is numeric.

Formalize a PyPi Release Process

We need to formalize a release process to PyPi that will automatically compile all supported versions for both sdist and bdist.

Add return codes to the conductor

The conductor currently terminates normally in all cases that it doesn't throw an exception. While this is useful for knowing that the conductor didn't error out, it's useless for using other scripts to know how a study terminated. This issue is intending to standardize a set of return codes that represent not only the normal termination of the conductor but to also assign codes for the termination of the study.

Add the ability to map other data sources

We need the ability to map other data sources besides the YAML file to a Maestro workflow to support use cases such as building the spec programmatically or through a workflow builder tool.

Resume from failure point

I have noticed that the execution graph is pickled.
Is there a plan for a way to resume a study execution from its last point of failure?

Add the ability to split allocations

Currently a workflow step is the only granularity for allocating nodes. There is not syntax to break apart a single step's allocation into further sub-allocations using a parallelization command. This lack of functionality prevents users from running multiple commands that may need to be run concurrently -- currently such commands would have to be scheduled independently limiting the possible flows that can be represented.

Cancellation of a local study claims success but keeps running

The cancellation of a locally executed study reports a success but continues running. In order to reproduce, run the following:

cd <path to maestrowf repo>
maestro run -o -y ./bugtest/lulesh ./samples/lulesh/lulesh_sample1_unix.yaml
cd ./bugtest/lulesh
maestro cancel .
watch maestro status .

Watching the status will show that the study continues executing, but the log will report that the study was cancelled.

Continuous Integration

In order to help you test existing functionality and avoid regressions, a test suite should be introduced. There are testing platforms available for python (i.e. nose, unittest, etc.). Once the test suite runs, the next step is to set up CI. TravisCI is free and integrated with Github for instance, there's also other alternatives which are less integrated such as buildbot.

Inputs and Source Code Context in the jobs

When running the study, since maestrowf is generating scripts and dispatching them with a scheduler, is there a policy to where the data and the code should be.
Do i have to have the code and the data with the study for it to be copied with the spec or make a clone/download of the both before running the code in each step?
I am asking this because in some cases we want to leave at least the data outside of the study and do fullpath/relativepath to it.

Fix Typos

There are typos in a couple places. It says decription when I believe it should say description. I found these errors while in the process of making a docker container. A warning came up saying:

/usr/local/lib/python3.6/distutils/dist.py:261: UserWarning: Unknown distribution option: 'decription'
  warnings.warn(msg)

I'm not totally sure if these caused the warning, but they should be fixed nonetheless.

decription='A tool and library for specifying and conducting general '

"provided for a valid study decription.")

BaseException.message has been deprecated as of Python 2.6

BaseException.message has been deprecated. Calls such as e.message will result in the following:

  • Python 2.7 returns the warning: DeprecationWarning: BaseException.message has been deprecated as of Python 2.6.
  • Python 3 throws an error: AttributeError: 'ValueError' object has no attribute 'message'.

The conductor is not robust when forced to terminate abnormally

It's the case where the conductor fails in execution due to unexpected issues (system crash or file permission errors due to unexpected file system issues). When the conductor crashes, the user currently must manually restart the study by pointing the conductor to the pkl file in the output directory. If done soon enough after a crash, a scheduler most likely still contains job states and the conductor can resume.

However, in the usual case -- the conductor crashes without the user's knowledge. The scheduler usually loses the state of the jobs by the time the conductor is restarted meaning that the ExecutionGraph doesn't recover because it cannot find the states of jobs that we previously running. The best case is that the jobs are long running and are still being managed by the scheduler. The worst case is that jobs have finished and the state is no longer kept.

The ExecutionGraph needs a more graceful way to handle these conditions -- there are a couple of options here:

  • If a step in the graph is detected to have been running and the job isn't detected -- restart the simulation from scratch.
  • If a step was pending -- restart it from scratch.
  • If a step had failed -- either consider it failed (make sure all dependent steps are marked as such) OR attempt to restart the step from scratch.
  • Otherwise, treat the step normally.

Add the ability to specify services

We should support a way to launch services, where a service is a command that wants to be spun up before executing the steps in a study. These services can be spun down once the study is determined to have completed (either a failure or success condition).

Rework the CLI to be more user friendly

With the introduction of added functionality like checking study statuses, restarting studies, restarting failures, etc we need a cleaner interface for users. This ticket is meant to track discussion and ideas on new command line ideas.

My current thoughts are as follows:
maestro run <study spec>
maestro status <study path>
maestro add <study path>

Something to the above effect to enable an interface for allowing multiple studies to be run with the ability to add more studies and to construct compilations of studies.
@jsemler, @joe-eklund, @danlaney -- Thoughts?

lock files in /tmp

Just an idea, but maybe the lock files could be stored in /tmp? This way it's a few fewer inodes used in our file systems.

Cancellation status is not reflected when executing locally

There's a bug when updating statuses of jobs executing locally. Because the local adapter currently executes steps sequentially one at a time and to completion, there's no period of time when a process would be reported cancelled during or after being executed. Therefore, when execute_ready_steps in the ExecutionGraph is called again by the conductor, it see the cancellation lock and aborts and never reporting a cancellation anywhere that a user can see.

Add the ability to specify which MPI launcher to use

@gonsie mentioned this issue previously, but I wanted to make note of it since this came up again when using the Flux-spectrum adapter. This issue addresses the ability to specify a different MPI launcher other than a schedulers default (say for SLURM, srun etc.).

The FluxSpectrum adapter handles this by requiring the mpi key in its batch parameters. I propose that this key, along with something like mpi_args become required batch keys. An adapter can then index into a standard set of MPI objects to get the appropriate launcher format (and validate args). The mpi key could even be optional, with each specific scheduling adapter specifying its preferred default.

Don't Generate Scripts all at once

If I wanted to run 1 billion studies, maestro would currently generate 1 billion scripts up front. That is just way too many files. Maybe there is a way to incrementally generate the scripts as they are needed.

If reproducibility isn't a concern, maybe the old scripts could be cleaned up.

Label: scaling

YAMLSpecification Requires the batch entry to exist.

The YAMLSpecification object currently has a hard requirement that the batch specification entry exist. The use case of running a workflow locally is also a valid one and should be accounted for. There are two avenues for achieving this:

  • Loosen the requirement that the batch entry must exist.
  • Create a "local" adapter that runs all steps locally.

Add the LULESH specification to TravisCI

As a base test for functionality, jobs should be able to run normally on a local machine. We should add the LULESH specification to the continuous integration tests.

Decide on a merging scheme

The generals are that a user should be able to specify multiple specifications. How should that look? The obvious two options are as follows:

  • Override full sections with specifications that are later in the command line. This is to say that in the command line maestro spec1 spec2 the user would see in the end specification any blocks that match in spec1 and spec2 (as well as blocks specified only in spec2)
  • Merge blocks. For a dictionary entry in the spec, update the dictionary to include keys that appear later and override matching keys.
  • Denote inheritance of values some how. This option provides a way to specify skeletons.

Clean up Makefile

The current default make command installs the full set of development packages. Clean up the Makefile to make it less monolithic in its install process.

Generate metadata about a study

It would be useful for MaestroWF to write metadata about a launched study. This would make it easier for post processing scripts to find output data and other information.

Metadata could include:

  • The MaestroWF version used to launch the study
  • Paths to workspaces, MaestroWF logs, etc.
  • Git hashes and other information that may not be in the YAML specification

Add the ability to cancel studies

Currently adapters are not required to provide a cancel method. We need to update the abstract ScriptAdapter to include a cancel method. The implementation of this method would also allow users to cancel running studies without having to find and kill the jobs themselves.

documentation

I'd like to start creating a documentation website... I think the doc strings can be easily imported into readthedocs or something similar.

Assign this one to me.

Workflow Steps with Restart Commands Do Not Restart

It seems that restart scripts are no longer being generated for steps that have restart commands. This bug may be the side effect of reworking some of the other functionality in the ScriptAdapter class. The resulting behavior is as follows:

  • When a _StepRecord is constructed, a step that contains a restart command will have that command stored (verified via ipython for a pkl file that had a workflow with restarts)
  • The same _StepRecord that contains the step (with restart command) reports that the restart script is None (nor is it in the output directory with generated scripts). This implies that the restart script is not being generated and set in the first place.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.