Git Product home page Git Product logo

slurm-mail's People

Contributors

danbarke avatar drhey avatar hakasapl avatar hugoch avatar jcklie avatar jitkang avatar langefa avatar mrgum avatar neilmunday avatar sdx23 avatar thgeorgiou avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

slurm-mail's Issues

Slurm Results E-Mail cannot resolve node name?

Versions

OS version: Centos 7
Slurm version: 22.05
Slurm Mail version: 4.1 Snapshot

Describe the bug

It looks like some variables cannot be resolved when looking up results files.

Logs

The last 25 lines of /x/y/f1000_transformer/logs/F1000_Training_F1000_Training.31224.%N.log are shown below:

slurm-mail: file /x/y/f1000_transformer/logs/F1000_Training_F1000_Training.31224.%N.log does not exist

Time limit reached e-mail intro needs correcting

With version 4.1 (and earlier), then a job reaches its time limit the text in the job ended e-mail is as follows:

Your job $JOBID has time limit reached on $CLUSTER.

This should be adjusted to read:

Your job $JOBID has reached its time limit on $CLUSTER

slurm-send-mail log file is only created if it already exists

This block of code in slurm-send-mail.py controls whether a log file is used or not:

    if log_file and log_file.is_file():
        logging.basicConfig(
            format=log_format, datefmt=log_date, level=log_level,
            filename=log_file
        )
    else:
        logging.basicConfig(
            format=log_format, datefmt=log_date, level=log_level
        )

This code means that the log file is only used if the log file already exists!

slurm-spool-mail.py set to debug log level

In slurm-spool-mail.py, the log level is hard coded to debug.

To resolve, add common logLevel configuration option to conf.d/slurm-mail.conf that can be used by slurm-spool-mail.py and slurm-send-mail.py to set the logging level. For slurm-send-mail.py its --verbose command line option should override the value in the configuration file.

KeyError: 'StdErr' with srun

Versions

OS version: Alma Linux 8.6
Slurm version: 20.11.9
Slurm Mail version: main (ca49444)

Describe the bug
Starting an interactive (but for some reason mail sending) job with srun --mail-type=... won't send mails
Logs

2022/06/20 17:28:01:ERROR: Failed to process: /var/spool/slurm-mail/67116962_1655738726.7596319.mail
2022/06/20 17:28:01:ERROR: 'StdErr'
Traceback (most recent call last):
  File "/opt/slurm-mail/bin/slurm-send-mail.py", line 827, in <module>
    process_spool_file(f, smtp_conn)
  File "/opt/slurm-mail/bin/slurm-send-mail.py", line 501, in process_spool_file
    job.stderr = scontrol_dict['StdErr']
KeyError: 'StdErr'

That happens because srun-jobs don't seem to have StdErr, StdIn and StdOut set:

Command=hostname
WorkDir=/home/example
Power=
MailUser=example

compared to using sbatch:

Command=/home/example/testscript.sh
WorkDir=/home/example/
StdErr=/home/example/slurm/output67116921_1.out
StdIn=/dev/null
StdOut=/home/example/slurm/output67116921_1.out
Power=
MailUser=example

Incorrect JOB_NAME in subject and body

Versions

OS version: Ubuntu 20.04
Slurm version: 21.08.8-2
Slurm Mail version: HEAD @32f9f31

Describe the bug

JOB_NAME seems to always appear as "allocation" instead of the actual job name.

Email subject was setup as: emailSubject = Job $JOB_NAME ($JOB_ID): $STATE

Resulting email shows "allocation" for the job name in the subject (and body): "Job allocation (2467): Ended"

Logs

2023/03/01 01:14:01:DEBUG: Called with: ['/usr/bin/slurm-spool-mail', '-s', 'Slurm Job_id=2467 Name=tmpmi43oxw_.sh Ended, Run time 00:00:01, COMPLETED, ExitCode 0', '...']
2023/03/01 01:14:01:DEBUG: info str: Slurm Job_id=2467 Name=tmpmi43oxw_.sh Ended

Mail not being sent on Ubuntu 20.04

Versions

OS version: Ubuntu 20.04
Slurm version: slurm 22.05.2
Slurm Mail version: 4.4

Describe the bug

I installed Slurm Mail with the .deb package, edited the config file, and can see the cron job running and emails being added to the spool dir and spool log. However, I never receive the emails, and the send-mail log remains empty. I am able to send mail from the server via sendmail and have confirmed that port 25 is open between our server and the mail server.

Logs

slurmctl.log

[2023-03-28T14:09:06.068] _slurm_rpc_submit_batch_job: JobId=453 InitPrio=4294901687 usec=544
[2023-03-28T14:09:06.458] sched/backfill: _start_job: Started JobId=453 in batch on GPU-SERVER
[2023-03-28T14:09:07.632] _job_complete: JobId=453 WEXITSTATUS 0
[2023-03-28T14:09:07.633] _job_complete: JobId=453 done

cron logs

Mar 28 14:07:01 SLURM-HEAD-NODE CRON[1000350]: (root) CMD (   /usr/bin/slurm-send-mail)
Mar 28 14:07:03 SLURM-HEAD-NODE sSMTP[1000449]: Sent mail for [email protected] (221 2.0.0 PLOPPAGENT02.OUR.RESOURCE.DOMAIN.COM closing connection) uid=0 username=root outbytes=739
Mar 28 14:07:03 SLURM-HEAD-NODE CRON[1000348]: pam_unix(cron:session): session closed for user root

spool-mail.log

2023/03/28 14:09:07:DEBUG: Called with: ['/usr/bin/slurm-spool-mail', '-s', 'Slurm Job_id=453 Name=mail_test Ended, Run time 00:00:01, COMPLETED, ExitCode 0', '[email protected]']
2023/03/28 14:09:07:DEBUG: info str: Slurm Job_id=453 Name=mail_test Ended
2023/03/28 14:09:07:DEBUG: Job ID: 453
2023/03/28 14:09:07:DEBUG: State: Ended
2023/03/28 14:09:07:DEBUG: Array Summary: False
2023/03/28 14:09:07:DEBUG: E-mail to: [email protected]
2023/03/28 14:09:07:INFO: writing file: /var/spool/slurm-mail/453_1680012547.8007495.mail

And of course nothing in the send-mail.log

Add CPU and memory utilisation to e-mails

As per the suggestion in discussion #28, add CPU and memory utilisation information to e-mails.

sacct can be used to get this information, e.g.

sacct -j $job_id --fields=jobid,totalcpu,elapsed,reqmem,maxrss

slurm mail Err: cancelled jobs that are pending

Versions

OS version:Centos7
Slurm version: 22.05.2
Slurm Mail version: 3.1

Describe the bug
Our site recently upgraded from Slurm 21.08.8 to 22.05.2 and noticed multiple errors within slurm-send-mail.py when users cancel pending jobs. We were able to produce the error with version 3.1 and 3.5 of Slurm Mail.

A clear and concise description of what the bug is and how it to produce it.

Have a user submit a job with mail-type=ALL. While job is pending, cancel the job and check slurm-send-mail.py logs.

We noticed pre 22.05.2, sacct Start would show 'Unknown' for a cancelled job. Using 22.05.2 sacct Start shows 'None' for a cancelled job.

Logs

Include any log snippets here - please used code tags.

The error we have received within slurm-send-mail.py:

2022/08/25 11:10:02:ERROR: Failed to process: /var/spool/slurm-mail/307964_1661442566.5445702.mail
2022/08/25 11:10:02:ERROR: invalid literal for int() with base 10: 'None'
Traceback (most recent call last):
File "/opt/slurm-mail/bin/slurm-send-mail.py", line 811, in
process_spool_file(f)
File "/opt/slurm-mail/bin/slurm-send-mail.py", line 449, in process_spool_file
job.start_ts = sacct_dict['Start']
File "/opt/slurm-mail/bin/slurm-send-mail.py", line 171, in start_ts
self.__start_ts = int(ts)
ValueError: invalid literal for int() with base 10: 'None'

We upgraded on 8/8. Using sacct, we compared a pending job that a user cancelled with version 21.08.8 to the same pending job cancellation with 22.05.2

21.08.8

Submit | Partition | Start | End | State | JobID
2022-08-04T12:25:43 | general | Unknown | 2022-08-04T12:26:10 | CANCELLED | 292735

22.05.2

Submit | Partition | Start | End | State | JobID
2022-08-08T16:21:33 | general | None | 2022-08-08T16:24:46 | CANCELLED | 299636

Merge 3.0 branch into main

Once all tasks for version 3.0 have been commited to the the 3.0 branch, merge into main and publish release.

Improve coverage of unit tests

The unit tests in version 4.1 only cover 29% of the code base.

Extend the unit tests to cover at least 85% of the code base.

slurmctld dependency

When installing slurm-mail with zypper on OpenSuse 15.3/15.4, the dependency slurm-slurmctld is not found:

# zypper install slurm-mail
Loading repository data...
Reading installed packages...
Resolving package dependencies...

Problem: nothing provides 'slurm-slurmctld' needed by the to be installed slurm-mail-3.6-1.sl15.noarch
 Solution 1: do not install slurm-mail-3.6-1.sl15.noarch
 Solution 2: break slurm-mail-3.6-1.sl15.noarch by ignoring some of its dependencies

slurmctld is actually provided by the package slurm, see https://documentation.suse.com/sle-hpc/15-SP3/html/hpc-guide/cha-slurm.html. Similary, it looks like the correct package for Ubuntu is slurmctld.

Standardise Python string format

There is a mixture of f-String and str.format usage. As f-Strings requires at least Python 3.6 we should standardise on str.format for now.

Variable Node in filename not resolved

Versions

OS version: Centos 7
Slurm version: 22.05.02
Slurm Mail version: 4.1

Describe the bug

The variable Node in the log filename is not resolved, so that the log is not displayed correctly in the e-mail.

The last 25 lines of <path>_Training_.31224.%N.log are shown below:

slurm-mail: file<path>__Training.31224.%N.log does not exist

Add support for additional mail types

Allow Slurm Mail to handle these additional e-mail types:

  • FAIL
  • REQUEUE
  • INVALID_DEPEND
  • STAGE_OUT
  • TIME_LIMIT
  • TIME_LIMIT_90
  • TIME_LIMIT_80
  • TIME_LIMIT_50
  • ARRAY_TASKS

(as defined in the sbatch man page for Slurm 21)

Make the format of e-mail subject configurable

Add setting to conf.d/slurm-mail.conf to allow the format of the e-mail subject to be customised.

E.g.

[slurm-send-mail]
subjectFormat = Slurm job $jobId: $jobName

slurm-mail would then substitute the variables above with the corresponding values from the job.

Cannot build RPM on CentOS with new version if slurmmail is already installed

Versions

OS version: Centos 7
Slurm Mail version: 4.1

Describe the bug

I tried building a 4.1 rpm on my slurm login node where slurmmail 4.0 is installed. I cloned the repository, built my RPM but it used as version 4.0. This was strange so debugging lead me to the following: The get-property.py and process-template.pyinclude slurmmail, but because we append to sys.path instead of prepending, the globally installed slurmmail is found first so it uses VERSION=4.0 and not VERSION=4.1.

This can be tested by having a RHEL node where slurmmail is installed and running get-property.py version.

The fix is easy, one just needs to prepend to sys instead of appending, I will send a PR once I am on my PC at home.

Add integration tests for all OSes

At present the integration tests only test Slurm-Mail under Rocky 8.

Add support for the following OSes for the integration tests:

  • CentOS 7
  • Rocky 9
  • SUSE 15
  • Ubuntu 20
  • Ubuntu 22

cron.d files not used in Centos 7

Versions

OS version: Centos 7
Slurm version: 22.05.2
Slurm Mail version: 4.0

Describe the bug

The slurm-mail job in cron.d is not picked up. After I add a newline to the file, then it works. In the internet, there are some pages which told me that. It is a really arcane issue and a nasty bug, as I did not see any log output telling me that 1 2.

Lots of slurm-send-mail.py stuck in cron

  • OS: RHEL 8.1
  • Slurm: 20.02.7
  • Python: 3.8.8

Just installed, and made sure to update slurm-mail.conf. Set up slurm.conf to define MailProg=/opt/slurm-mail/bin/slurm-spool-mail.py and set up the crontab.

But, I am not receiving any email. Slurm's default mail, and smail both work. I tested these before trying slurm-mail.

When I do systemctl status crond there are multiple slurm-send-mail.py entries (truncated here):

   CGroup: /system.slice/crond.service
           ├─ 1851 /usr/sbin/CROND -n
           ├─ 1852 python3 /opt/slurm-mail/bin/slurm-send-mail.py
           ├─ 1853 /usr/sbin/sendmail -FCronDaemon -i -odi -oem -oi -t -f root
           ├─ 1854 /cm/shared/apps/slurm/current/bin/sacct -j 841776 -p -n --fields=JobId,Partition,JobName,Start,End,St>
           ├─ 1857 /usr/sbin/postdrop -r
           ├─ 5358 /usr/sbin/CROND -n
           ├─ 5359 python3 /opt/slurm-mail/bin/slurm-send-mail.py
           ├─ 5360 /usr/sbin/sendmail -FCronDaemon -i -odi -oem -oi -t -f root
           ├─ 5362 /cm/shared/apps/slurm/current/bin/sacct -j 841776 -p -n --fields=JobId,Partition,JobName,Start,End,St>
           ├─ 5370 /usr/sbin/postdrop -r
           ├─ 7143 /usr/sbin/crond -n
           ├─ 8319 /usr/sbin/CROND -n
           ├─ 8320 python3 /opt/slurm-mail/bin/slurm-send-mail.py
           ├─ 8321 /usr/sbin/sendmail -FCronDaemon -i -odi -oem -oi -t -f root
           ├─ 8322 /cm/shared/apps/slurm/current/bin/sacct -j 841776 -p -n --fields=JobId,Partition,JobName,Start,End,St>
           ├─ 8325 /usr/sbin/postdrop -r
           ├─11917 /usr/sbin/CROND -n
           ├─11918 python3 /opt/slurm-mail/bin/slurm-send-mail.py
           ├─11919 /usr/sbin/sendmail -FCronDaemon -i -odi -oem -oi -t -f root
           ├─11920 /cm/shared/apps/slurm/current/bin/sacct -j 841776 -p -n --fields=JobId,Partition,JobName,Start,End,St>
           ├─11923 /usr/sbin/postdrop -r
           ├─14280 /usr/sbin/CROND -n
           ├─14281 python3 /opt/slurm-mail/bin/slurm-send-mail.py
           ├─14282 /usr/sbin/sendmail -FCronDaemon -i -odi -oem -oi -t -f root
           ├─14283 /cm/shared/apps/slurm/current/bin/sacct -j 841776 -p -n --fields=JobId,Partition,JobName,Start,End,St>
           ├─14286 /usr/sbin/postdrop -r
           ├─17710 /usr/sbin/CROND -n
           ├─17711 python3 /opt/slurm-mail/bin/slurm-send-mail.py
           ├─17714 /usr/sbin/sendmail -FCronDaemon -i -odi -oem -oi -t -f root
           ├─17715 /cm/shared/apps/slurm/current/bin/sacct -j 841776 -p -n --fields=JobId,Partition,JobName,Start,End,St>
           ├─17718 /usr/sbin/postdrop -r
           ├─19948 /usr/sbin/CROND -n
           ├─19949 python3 /opt/slurm-mail/bin/slurm-send-mail.py
           ├─19950 /usr/sbin/sendmail -FCronDaemon -i -odi -oem -oi -t -f root
           ├─19951 /cm/shared/apps/slurm/current/bin/sacct -j 841776 -p -n --fields=JobId,Partition,JobName,Start,End,St>
           ├─19954 /usr/sbin/postdrop -r
           ├─23375 /usr/sbin/CROND -n
           ├─23376 python3 /opt/slurm-mail/bin/slurm-send-mail.py
           ├─23380 /usr/sbin/sendmail -FCronDaemon -i -odi -oem -oi -t -f root
           ├─23381 /cm/shared/apps/slurm/current/bin/sacct -j 841776 -p -n --fields=JobId,Partition,JobName,Start,End,St>
           ├─23384 /usr/sbin/postdrop -r
           ├─25885 /usr/sbin/CROND -n
           ├─25886 python3 /opt/slurm-mail/bin/slurm-send-mail.py
           ├─25887 /usr/sbin/sendmail -FCronDaemon -i -odi -oem -oi -t -f root
           ├─25888 /cm/shared/apps/slurm/current/bin/sacct -j 841776 -p -n --fields=JobId,Partition,JobName,Start,End,St>
           ├─25891 /usr/sbin/postdrop -r
           ├─26883 /usr/sbin/CROND -n
           ├─26886 python3 /opt/slurm-mail/bin/slurm-send-mail.py
           ├─26887 /usr/sbin/sendmail -FCronDaemon -i -odi -oem -oi -t -f root
           ├─26888 /cm/shared/apps/slurm/current/bin/sacct -j 841776 -p -n --fields=JobId,Partition,JobName,Start,End,St>
           ├─26889 /usr/sbin/postdrop -r
           ├─29252 /usr/sbin/CROND -n
           ├─29253 python3 /opt/slurm-mail/bin/slurm-send-mail.py
           ├─29257 /usr/sbin/sendmail -FCronDaemon -i -odi -oem -oi -t -f root
           ├─29258 /cm/shared/apps/slurm/current/bin/sacct -j 841776 -p -n --fields=JobId,Partition,JobName,Start,End,St>
           ├─29261 /usr/sbin/postdrop -r
           ├─30508 /usr/sbin/CROND -n
           ├─30510 python3 /opt/slurm-mail/bin/slurm-send-mail.py
           ├─30519 /usr/sbin/sendmail -FCronDaemon -i -odi -oem -oi -t -f root
           ├─30520 /cm/shared/apps/slurm/current/bin/sacct -j 841776 -p -n --fields=JobId,Partition,JobName,Start,End,St>
           ├─30523 /usr/sbin/postdrop -r
           ├─31536 /usr/sbin/CROND -n

slurm-spool-mail.log shows entries like:

2021/06/30 12:03:33:DEBUG: Called with: ['/opt/slurm-mail/bin/slurm-spool-mail.py', '-s', 'Slurm Job_id=841785 Name=tstfoo_4node.sh Began, Queued time 00:00:03', 'dwc62']
2021/06/30 12:03:33:DEBUG: info str: Slurm Job_id=841785 Name=tstfoo_4node.sh Began
2021/06/30 12:03:33:DEBUG: Job ID: 841785
2021/06/30 12:03:33:DEBUG: Action: Began
2021/06/30 12:03:33:DEBUG: User: dwc62
2021/06/30 12:03:33:DEBUG: Job ID match, writing file /var/spool/slurm-mail/841785.Began.mail
2021/06/30 12:04:13:DEBUG: Called with: ['/opt/slurm-mail/bin/slurm-spool-mail.py', '-s', 'Slurm Job_id=841785 Name=tstfoo_4node.sh Ended, Run time 00:00:40, COMPLETED, ExitCode 0', 'dwc62']
2021/06/30 12:04:13:DEBUG: info str: Slurm Job_id=841785 Name=tstfoo_4node.sh Ended
2021/06/30 12:04:13:DEBUG: Job ID: 841785
2021/06/30 12:04:13:DEBUG: Action: Ended
2021/06/30 12:04:13:DEBUG: User: dwc62
2021/06/30 12:04:13:DEBUG: Job ID match, writing file /var/spool/slurm-mail/841785.Ended.mail

Cannot make slurm-mail send emails

if i execute /usr/bin/python /opt/slurm-mail/bin/slurm-send-mail.py nothing happens, any ideas? after launching a slurm job, no log activity, no emails.

apt remove failure

Versions

OS version: Ubuntu 20.04
Slurm version: 21.08.8-2
Slurm Mail version: slurm-mail_4.3ub20-ubuntu1_all.deb

Describe the bug

Running "apt remove slurm-mail" fails. Workaround is to manually rm -r /var/spool/slurm-mail and rerun apt remove.

Logs

# apt remove slurm-mail
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following packages will be REMOVED:
  slurm-mail
0 upgraded, 0 newly installed, 1 to remove and 51 not upgraded.
After this operation, 97.3 kB disk space will be freed.
Do you want to continue? [Y/n] y
(Reading database ... 219180 files and directories currently installed.)
Removing slurm-mail (4.3ub20-ubuntu1) ...
rm: cannot remove '/var/spool/slurm-mail': Is a directory
dpkg: error processing package slurm-mail (--remove):
 installed slurm-mail package post-removal script subprocess returned error exit status 1
dpkg: too many errors, stopping
Errors were encountered while processing:
 slurm-mail
Processing was halted because there were too many errors.
E: Sub-process /usr/bin/dpkg returned an error code (1)

Include some of user's job output in job completion e-mails

Add a config option to allow the last N lines of output to be included in the job completion e-mails. This will only work on hosts where the file system that contains the user's job output files is available to the host running slurm-mail.

get_kbytes_from_str: unknown unit 'C' for value '0c'

Hello,

I'm running version 3.1 of slurm-mail with verbose=true set in slurm-mail.conf.

Does anybody know what is the ERROR message (from slurm-send-mail.log) reported here below?

Thank you!

2022/05/18 12:35:01:ERROR: get_kbytes_from_str: unknown unit 'C' for value '0c'
2022/05/18 12:35:01:DEBUG: Running /usr/local/bin/scontrol -o show job=33837
2022/05/18 12:35:01:DEBUG: Creating template for job 33837
2022/05/18 12:35:01:DEBUG: Creating e-mail signature template
2022/05/18 12:35:01:INFO: Sending e-mail to: XXXX using [email protected] for job 33837 (Ended) via SMTP server smtps.XXXX.XX:587
2022/05/18 12:35:01:INFO: Deleting: /var/spool/slurm-mail/33837_1652870050.0510437.mail
2022/05/18 12:36:01:INFO: processing: /var/spool/slurm-mail/33838_1652870160.1111429.mail
2022/05/18 12:36:01:DEBUG: Running /usr/local/bin/sacct -j 33838 -P -n --fields=JobId,User,Group,Partition,Start,End,State,ReqMem,MaxRSS,NCPUS,TotalCPU,NNodes,WorkDir,Elapsed,ExitCode,Comment,Cluster,NodeList,TimeLimit,TimelimitRaw,JobIdRaw,JobName
2022/05/18 12:36:01:DEBUG: 33838|XXXX|XXXXXXXXXX|1652870039|1652870160|COMPLETED|0c||1|00:00.018|1|/cluster/home/staff/XXXX/unix/tmp|00:02:01|0:0||cluster|node49|10:10:00|610|33838|TEST
33838.batch||||1652870039|1652870160|COMPLETED|0c|472K|1|00:00.018|1||00:02:01|0:0||cluster|node49|||33838.batch|batch
33838.extern||||1652870039|1652870160|COMPLETED|0c|0|1|00:00:00|1||00:02:01|0:0||cluster|node49|||33838.extern|extern

Computing "elapsed" breaks when start or end is none

Versions

OS version: Centos 7
Slurm version: 22.05
Slurm Mail version: 4.1 Snapshot

Describe the bug

TypeError: unsupported operand type(s) for -: 'int' and 'NoneType'. I think this is related to #58 .

Logs

2022/11/02 12:42:03:ERROR: unsupported operand type(s) for -: 'int' and 'NoneType'
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/slurmmail/cli.py", line 587, in send_mail_main
    __process_spool_file(f, smtp_conn, options)
  File "/usr/lib/python3.6/site-packages/slurmmail/cli.py", line 275, in __process_spool_file
    job.save()
  File "/usr/lib/python3.6/site-packages/slurmmail/slurm.py", line 203, in save
    self.elapsed = (self.__end_ts - self.__start_ts)
TypeError: unsupported operand type(s) for -: 'int' and 'NoneType'

Add package removal to integration tests

The existing integration tests check that the packages install ok but given issue #78 the tests should also check that the packages uninstall ok on the various operating systems supported by Slurm-Mail.

Deb package installs fail on Ubuntu

Versions

OS version: Ubuntu 20.04
Slurm version: 21.08.8-2
Slurm Mail version: HEAD @ 32f9f31

Describe the bug

Custom built deb package fails to install cleanly. This is due to slurm-mail.postinst specifying #!/usr/bin/bash which doesn't exist on Ubuntu 20.04. Could this be changed to #!/usr/bin/env bash for portability?

Logs

Setting up slurm-mail (4.2-ubuntu1) ...
dpkg (subprocess): unable to execute installed slurm-mail package post-installation script (/var/lib/dpkg/info/slurm-mail.postinst): No such file or directory
dpkg: error processing package slurm-mail (--configure):
installed slurm-mail package post-installation script subprocess returned error exit status 2
Errors were encountered while processing:
slurm-mail

Using scontrol for jobs that were canceled yields "invalid job id"

Versions

OS version: Centos 7
Slurm version: 22.05.02
Slurm Mail version: 4.1 Snapshot

Describe the bug

Using scontrol for jobs that were canceled yields "invalid job id"

Logs

2022/11/02 12:51:03:INFO: processing: /var/spool/slurm-mail/31178_1667381825.054817.mail
2022/11/02 12:51:03:WARNING: job 31178: could not parse 'None' for job start timestamp
2022/11/02 12:51:03:ERROR: Failed to run: /usr/bin/scontrol -o show job=31178
2022/11/02 12:51:03:ERROR:
2022/11/02 12:51:03:ERROR: slurm_load_jobs error: Invalid job id specified

The output of the commands is

$ /usr/bin/scontrol -o show job=31178
slurm_load_jobs error: Invalid job id specified

$ sacct -j 31178 -P -n --fields=JobId,User,Group,Partition,Start,End,State,ReqMem,MaxRSS,NCPUS,TotalCPU,NNodes,WorkDir,Elapsed,ExitCode,Comment,Cluster,NodeList,TimeLimit,TimelimitRaw,JobIdRaw,JobName

31178|censored_user_name|domänen-benutzer|ukp|None|2022-11-02T10:37:04|CANCELLED by 1060117793|36G||4|00:00:00|1|censored_path|00:00:00|0:0||ukp-cluster|None assigned|3-00:00:00|4320|31178|vada

Slurm-mail sends contradictiory emails (subject says 'cancelled', body 'begin')

Versions

OS version: Centos 7
Slurm version: 22.05
Slurm Mail version: 4.1 Snapshot

Describe the bug

Slurm-mail sends contradictiory emails (subject says 'cancelled', body 'begin')

From: UKP Slurm <[email protected]>
Subject: UKP Slurm - Job 31928: **cancelled**
Date: 7. November 2022 at 2:52:05 PM CET
To: <xyz>

Dear X,
Your job 31928 has **started** on ukp-cluster.
Details about the job can be found in the table below:

Logs

The slurm-mail log says the job started. I think the job waited a bit in the queue before it got to run.

slurm_load_jobs error: Invalid job id specified

The script used to work just fine but it stopped about 10 days ago and looking through the "/var/log/slurm-mail/slurm-send-mail.log" I noticed the following errors:

2020/12/03 11:21:03:INFO: processing: /var/spool/slurm-mail/702517.Ended.mail
2020/12/03 11:21:04:ERROR: failed to run: /usr/local/bin/scontrol -o show job=702517
2020/12/03 11:21:04:ERROR:
2020/12/03 11:21:04:ERROR: slurm_load_jobs error: Invalid job id specified

2020/12/03 11:21:04:INFO: sending e-mail to: using xxx for job 702517 (Ended) via SMTP server smtp.gmail.com:587
2020/12/03 11:21:04:ERROR: failed to process: /var/spool/slurm-mail/702517.Ended.mail
2020/12/03 11:21:04:ERROR: Connection unexpectedly closed
Traceback (most recent call last):
File "/opt/slurm-mail/bin/slurm-send-mail.py", line 546, in
s.login(smtpUserName, smtpPassword)
File "/usr/lib/python3.6/smtplib.py", line 721, in login
initial_response_ok=initial_response_ok)
File "/usr/lib/python3.6/smtplib.py", line 631, in auth
(code, resp) = self.docmd("AUTH", mechanism + " " + response)
File "/usr/lib/python3.6/smtplib.py", line 421, in docmd
return self.getreply()
File "/usr/lib/python3.6/smtplib.py", line 394, in getreply
raise SMTPServerDisconnected("Connection unexpectedly closed")
smtplib.SMTPServerDisconnected: Connection unexpectedly closed

The user's email address is correctly captured in the file "/var/spool/slurm-mail/702517.Ended.mail"

It seems that the command to get the job details (/usr/local/bin/scontrol -o show job=702517) fails for jobs that finished, but it works fine for jobs that are still running:

/usr/local/bin/scontrol -o show job=702588

slurm_load_jobs error: Invalid job id specified

/usr/local/bin/scontrol -o show job=702514

JobId=702514 JobName=python UserId=XXX(10526) GroupId=XXX(10541) MCS_label=N/A Priority=2004244 Nice=0 Account=XXX QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=14:23:57 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2020-12-02T20:59:27 EligibleTime=2020-12-02T20:59:27 AccrueTime=Unknown StartTime=2020-12-02T20:59:27 EndTime=Unknown Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-12-02T20:59:27 Partition=cpu AllocNode:Sid=q:1004 ReqNodeList=(null) ExcNodeList=(null) NodeList=cpu1 BatchHost=cpu1 NumNodes=1 NumCPUs=4 NumTasks=1 CPUs/Task=4 ReqB:S:C:T=0:0:: TRES=cpu=4,mem=16G,node=1,billing=7 Socks/Node=* NtasksPerN:B:S:C=0:0:: CoreSpec=* MinCPUsNode=4 MinMemoryNode=16G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=python WorkDir=XXX Power=

We are running Slurm 18.08 in this cluster but have same issue with the other cluster where we run slurm 19.05.3-2

Convert username to email

The default email for slurm is just the username, if I'm not mistaken. I can go from username to email via python ldap.

Is this something that can be done? Where exactly in the existing code should I add such a feature?

Slurm does not send email at any state(Failed or BEGIN)

Versions

OS version: 18.04.6 LTS
Slurm version: 22.05.2
Slurm Mail version: 3.5

Describe the bug

I reconfigured slurm.conf to point to the location of the mailprog, however on the logs it doesnt show that the python script itself is being called at any state. I tried running directly using python3 slurm-spool-mail.py and it shows invalid number of arguments, and the error was logged at the logfile. Hence I realized that slurm cannot call the python script, is there a test command line that I can test to see if the script itself is working? Thanks

Add error handling for SMTP connection

At the time of writing, any exceptions generated by this line:

smtp_conn.sendmail(email_from_address, user_email.split(","), msg.as_string())

are not trapped. Therefore, the errors are not logged.

Evidence of this can be found in issue #44

Job Name not Parsed Properly

Greetings,

We just started implemented this into our HPC cluster, and everything works fine so far. Just one minor issue we have is the parsing of the job data.

The job name from the email are shown as followed where there is a | character at the end of the job name:

Name: mail_test|

After digging into the script, we found that one simple changes can be done to remove the | character, which is to change the parameter -p into -P in slurm-send-mail.py under process_spool_file function:

cmd = (f"{sacct_exe} -j {first_job_id} -P -n --fields={field_str}")

This remove the pipe ending of the output, and thus eliminate the | character in the email sent.

Tasks for version 4.0

The following tasks need to be completed for version 4.0:

  • Create setup.py to install Slurm-Mail
  • Install to standard OS location, e.g. /usr/bin, /etc, /usr/lib/python... etc.
  • Add unit tests
  • Update README

Start time and raw time limit cannot be parsed if its "partitionlimit" or None

Versions

OS version: Centos 7
Slurm version: 22.05.02
Slurm Mail version: 4.1

Describe the bug

I see the following in my slurm mail log:

2022/11/01 14:14:01:INFO: processing: /var/spool/slurm-mail/31114_1667303920.1738784.mail
2022/11/01 14:14:01:WARNING: job 31114: could not parse 'None' for job start timestamp
2022/11/01 14:14:01:ERROR: Failed to process: /var/spool/slurm-mail/31114_1667303920.1738784.mail
2022/11/01 14:14:01:ERROR: invalid literal for int() with base 10: 'Partition_Limit'
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/slurmmail/cli.py", line 587, in send_mail_main
    __process_spool_file(f, smtp_conn, options)
  File "/usr/lib/python3.6/site-packages/slurmmail/cli.py", line 227, in __process_spool_file
    job.wallclock = int(sacct_dict['TimelimitRaw']) * 60
ValueError: invalid literal for int() with base 10: 'Partition_Limit'

Looking at the sacct output, I see

[root@wormulon slurm-mail]# sacct -j 31114 -P -n --fields=JobId,User,Group,Partition,Start,End,State,ReqMem,MaxRSS,NCPUS,TotalCPU,NNodes,WorkDir,Elapsed,ExitCode,Comment,Cluster,NodeList,TimeLimit,TimelimitRaw,JobIdRaw,JobName

31114|XXX|domänen-benutzer|yolo|None|2022-11-01T12:58:39|CANCELLED by 1060117917|32G||2|00:00:00|1|/mnt/beegfs/work/XXX/HGN/XYZ/HGN|00:00:00|0:0||ukp-cluster|None assigned|Partition_Limit|Partition_Limit|31114|multimodal_gated

It looks like it is an edge case for a job canceled by user, start timestamp and time limit are not guaranteed to be ints it seems?

Issues installing and using slurm-mail cron on Ubuntu 20.04

Versions

OS version: Ubuntu 20.04
Slurm version: 21.08.8
Slurm Mail version: HEAD @ fd79873

Describe the bug

apt install with deb package doesn't write /etc/cron.d/slurm-mail/slurm-mail or /etc/logrotate.d/slurm-mail/slurm-mail.
However, running with dpkg --force-all works.

# apt install /tmp/slurm-mail_4.2-ubuntu1_all.deb
...
Setting up slurm-mail (4.2-ubuntu1)
W: Repository is broken: slurm-mail:amd64 (= 4.2-ubuntu1) has no Size information

# echo $?
0

Not sure if the repository size warning has anything to do with the issue but the install was successful. Now see this:

# ls /etc/cron.d/slurm-mail/slurm-mail
ls: cannot access '/etc/cron.d/slurm-mail/slurm-mail': No such file or directory
# ls /etc/logrotate.d/slurm-mail/slurm-mail
ls: cannot access '/etc/logrotate.d/slurm-mail/slurm-mail': No such file or directory

The following works around the issue:

dpkg --force-all -i  slurm-mail_4.2-ubuntu1_all.deb
...
# ls /etc/cron.d/slurm-mail/slurm-mail
/etc/cron.d/slurm-mail/slurm-mail

Additionally, /etc/cron.d/slurm-mail/slurm-mail seems to be an invalid location on Ubuntu. Once installed the cron is never triggered. According to https://superuser.com/questions/452085/is-it-possible-to-add-directories-under-cron-d#answer-452250, it is not possible to nest sub-folders under /etc/cron.d/. Moving the slurm-mail file out of the sub-folder to fixes the issue.

Handle mail errors without infinite resend

Thanks for this awesome project! It was really easy to set up using your detailed documentation and I enjoy the much richer information now. I ran into a small issue when using this, as I send mails via an internal relay that only accepts mail to addresses in the form of @my-university.com. This causes now the issue that slurm-mail dies when sending to valid but outside addresses and tries to resend every minute. I thought of the following fixes:

  1. Expose the mail regex so it is user configurable (just solves my issue)
  2. Make deleting job spools on (smtp) errors configurable

I do not know what a good solution is, I think the first one is better. If you agree, then I can also send a PR.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.