Git Product home page Git Product logo

slutil's Introduction

slutil

A modern slurm experience.

A command line utility to view Slurm jobs. Written with rich and click.

Usage: slutil [OPTIONS] COMMAND [ARGS]...

Commands:
  report  Get status of multiple jobs
  status  Get status of a slurm job.
  submit  Submit a slurm job.

Contributing

Pushes to main are forbidden, all changes must go through a PR before merging. All tests must pass for a PR to be merged. Code is to be formatted with black. Built with poetry.

submit

Add metadata to an sbatch command and store data in the database

Must be used to log jobs in the database.

Usage: slutil submit [OPTIONS] SBATCH_FILE DESCRIPTION

  Submit a slurm job.

  SBATCH_FILE is a path to the .sbatch file for the job

  DESCRIPTION is a text field describing the job

Options:
  --help  Show this message and exit.

report

View list of recent jobs

  • Count parameter specifies the number of jobs to be displayed. Defaults to 10.
  • Truncated to screen width by default, -v to enable word-wrap.
Usage: slutil report [OPTIONS]

  Get status of multiple jobs

Options:
  -c, --count INTEGER
  -v, --verbose
  --help               Show this message and exit.

status

Displays the data on a specific job

Usage: slutil status [OPTIONS] SLURM_ID

  Get status of a slurm job.

  SLURM_ID is the id of the job to check.

Options:
  --help  Show this message and exit.

slutil's People

Contributors

eugene-prout avatar mcleish7 avatar

Stargazers

 avatar

Watchers

James Cloos avatar  avatar

Forkers

mcleish7

slutil's Issues

Add VCS interface

Is your feature request related to a problem? Please describe.
Tests need a consistent commit tag. slutil submit fails if not in git repo

Describe the solution you'd like
Add an adapter for an abstract version control system and ship with a git implementation.

Edit command

Is your feature request related to a problem? Please describe.
Once job has finished I want to add more information to the description e.g. why it failed

Describe the solution you'd like
A command to edit the description of the job

Describe alternatives you've considered

Additional context

Git adapter commands fail when not in a repo

Describe the bug
When not in a git repo, slutil cannot detect the latest commit: fatal: not a git repo, also fails when there are no commits in a repo with a different error. slutil should fail silently and mark the git commits as unknown for these jobs.

Allow read-only views

Is your feature request related to a problem? Please describe.
Unable to view data on a machine without slurm as most commands require slurm access.

Describe the solution you'd like
When using slutil on a system without slurm the system should be read-only. It should warn the user and show the timestamp from when the last update was.

Describe alternatives you've considered

Additional context

Add "diff" command

Is your feature request related to a problem? Please describe.
slutil report doesn't always show me relevant information. I only want to see jobs which have changed state since I last looked

Describe the solution you'd like
slutil report stores the last read time and if anything has changed show me, otherwise ignore.

Add filter command

Add command to filter jobs from a criteria.

  • Description (fuzzy c/i match)
  • sbatch file
  • commit tag

Should AND all the filters

Update multiple job states with one slurm call

Is your feature request related to a problem? Please describe.
When requesting multiple jobs slutil will iterate over each and make individual calls to sacct. This is inefficient and introduces lots of slurm calls.

Describe the solution you'd like
slutil calls a slurm command which returns the state of many different jobs in one response. This would speed up the code and reduce overhead.

Better error handling when loading from an incorrect csv

Is your feature request related to a problem? Please describe.
Currently if the .csv file is incorrectly formatted, the file loading will silently fail and errors will surface when interacting with the commands.

Describe the solution you'd like
An error message when loading an incorrect job. The system should check how many entries the csv row has, and if the type is correct. If there are any issues, an error should be displayed to the user error loading job <jobid> and the system should try and recover from the error by ignoring that job. Automatic error correction is not in the scope of this feature.

`delete` command doesn't stop the job in Slurm

Describe the bug
When deleting a job you would expect the job to be removed from Slurm as well. Currently the job continues executing in Slurm.

Expected behavior
When running slutil delete <job id> the program calls scancel <job id>. This would require slurm access. if there is no slurm access, the application could prompt the user that there is no slurm access and ask if they want to remove the job from slutil whilst still allowing it to run on the cluster.

Test on multiple Python versions

Is your feature request related to a problem? Please describe.
Currently pytest only runs on one version of Python. We want to target as much of 3.x as possible so need to test widely.

Describe the solution you'd like
Use something like tox in the CI pipeline and locally to ensure compatibility with Python features. This may require removing newer language features. Have badges from the CI pipeline showing which versions are supported.

Describe alternatives you've considered
Only supporting one version, wouldn't be good for wide distribution.

Additional context
Plan:

  • Find tool for testing
  • See which versions of Python are compatible with current codebase
  • Remove modern features if needed
  • Add this to pipelines

Support wider range of slurm ids

Is your feature request related to a problem? Please describe.
When submitting a job array, slurm gives the job the id 123456_1, 123456_2...

Describe the solution you'd like
Widen the supported types of job id from integer to string. Should still be a UUID as slurm outputs the ids.

Describe alternatives you've considered
Not supporting these jobs

Additional context
May need to rework if the full id is not immediately available. e.g. slutil submit --batch ...

Support extra arguments for `sbatch` in the submit command

Description:
Adding ability to pass command line options to slurm e.g. sbatch --dependency=afterok:624713 test_4.sh

Solution should:
Have the ability to execute 'sbatch --dependency=afterok:624713 test_4.sh' or other command line options for slurm with slutil.

`report` fails when job is recently created

Describe the bug
After submitting a job, slutil report fails as sacct <flags> returns no data. Changing the flags does not add anything to the output.

Fix

Add a delay to the status check method. Example: don't run sacct if the job has been added in the last 5 seconds. Need to root-cause it in the slurm code.

Add ability to remove jobs from database

Is your feature request related to a problem? Please describe.
After a job has finished, sometimes there is no need to keep the entry in the .csv (bad data/config errors etc...)

Describe the solution you'd like
A new command is added, delete which takes a slurm job id, shows the job and confirms that user wishes to delete that job. Should probably be a soft-delete. Add deleted flag, and switch on the deleted flag when showing results. Could add an undelete command which restores the job.

Describe alternatives you've considered

Additional context

Protect main branch

Prevent people from pushing directly to main. Need to allow Github Action bot to commit for release tagging.

Writing to CSV is not atomic

Describe the bug
If an exception is raised when the csv repository is writing, there is data loss. The in-memory data is discarded and cannot be recovered. This is a violation of ACID as the transaction should fully complete or fail and not change the csv file.

To Reproduce
Raise exception halfway through writing to csv

Expected behavior
No change is made if the writing fails

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Could be fixed by change the responsibility of who is responsible of serialising data to make the actual write as bug-free as possible, e.g. if the write starts, it is just writing data no calculations will be done (which is how it currently works)

Warn user on a dirty commit

Is your feature request related to a problem? Please describe.
The commit in the job description may not be accurate, for example, if there are uncommited changes.

Describe the solution you'd like
When submitting the job, if the commit is dirty warn the user. If they are aware and wish to continue, slutil should allow it but mark the commit as dirty, maybe like: COMMIT: abc123[d] where [d] shows it dirty.

Describe alternatives you've considered
Ignoring this and making the user take responsibility.

Upload old jobs

Allow old jobs to be uploaded to slutil, if not submitted through slutil

       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
484698        test_4.sh        gpu        dcs         16  COMPLETED      0:0 
484698.batch      batch                   dcs         16  COMPLETED      0:0 
484698.exte+     extern                   dcs         16  COMPLETED      0:0 
484784        test_4.sh        gpu        dcs         16 CANCELLED+      0:0 
484784.batch      batch                   dcs         16  CANCELLED     0:15 
484784.exte+     extern                   dcs         16  COMPLETED      0:0 
484972       test_spar+        gpu        dcs         16  COMPLETED      0:0 
484972.batch      batch                   dcs         16  COMPLETED      0:0 
484972.exte+     extern                   dcs         16  COMPLETED      0:0 
485436       test_tran+        gpu        dcs         16    RUNNING      0:0 
485436.batch      batch                   dcs         16    RUNNING      0:0 
485436.exte+     extern                   dcs         16    RUNNING      0:0 
485437       chess_tra+        gpu        dcs         16 CANCELLED+      0:0 
485531        test_5.sh        gpu        dcs         16    PENDING      0:0 
485532       test_5_2.+        gpu        dcs         16    PENDING      0:0 
485533       test_5_2.+        gpu        dcs         16    PENDING      0:0 
485534        test_5.sh        gpu        dcs         16    PENDING      0:0 
485535        test_4.sh        gpu        dcs         16    PENDING      0:0 
485536        test_4.sh        gpu        dcs         16    PENDING      0:0 

Add step-by-step submit

Is your feature request related to a problem? Please describe.
Submitting a job with lots of options is hard to remember.

Describe the solution you'd like
When submitting a job with no information just "slutil submit" prompt the user for individual field values, similar to adding a user on linux.

Describe alternatives you've considered
Improving existing documentation.

Additional context
Could create a more generic system taking in a dictionary of prompts allowing this system to used throughout the codebase

Add acceptance tests running on "real" slurm

Is your feature request related to a problem? Please describe.
Currently slutil is tested using a fake slurm interface. Testing with a real slurm daemon would give more assurance that the code actually works.

Describe the solution you'd like
Add a layer to the testing pyramid which runs the code with a functional slurm service (but non-timeconsuming jobs) and asserts that the system behaves appropriately. The tests will take longer to run than the unit tests so they should be separated.

Additional context
Using docker-compose to start a slurm cluster and installing slutil on the cluster.

Better error output for when subprocess calls fail

Is your feature request related to a problem? Please describe.
Currently when a subprocess fails there is no context to the error, just a message from subprocess saying the exit code was non-zero.

Describe the solution you'd like
slutil should repackage the error to give the user an appropriate message to help with debugging

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.