Git Product home page Git Product logo

docker-centos7-slurm's Introduction

Slurm on CentOS 7 Docker Image

Docker Pulls

This is an all-in-one Slurm installation. This container runs the following processes:

  • slurmd (The compute node daemon for Slurm)
  • slurmctld (The central management daemon of Slurm)
  • slurmdbd (Slurm database daemon)
  • slurmrestd (Slurm REST API daemon)
  • munged (Authentication service for creating and validating credentials)
  • mariadb (MySQL compatible database)
  • supervisord (A process control system)

It also has the following Python versions installed using pyenv:

  • 3.6.15
  • 3.7.12
  • 3.8.12
  • 3.9.9
  • 3.10.0

Usage

There are multiple tags available. To use the latest available image, run:

docker pull giovtorres/docker-centos7-slurm:latest
docker run -it -h slurmctl --cap-add sys_admin giovtorres/docker-centos7-slurm:latest

The above command will drop you into a bash shell inside the container. Tini is responsible for init and supervisord is the process control system . To view the status of all the processes, run:

[root@slurmctl /]# supervisorctl status
munged                           RUNNING   pid 23, uptime 0:02:35
mysqld                           RUNNING   pid 24, uptime 0:02:35
slurmctld                        RUNNING   pid 25, uptime 0:02:35
slurmd                           RUNNING   pid 22, uptime 0:02:35
slurmdbd                         RUNNING   pid 26, uptime 0:02:35
slurmrestd                       RUNNING   pid 456, uptime 0:02:00

In slurm.conf, the ControlMachine hostname is set to slurmctl. Since this is an all-in-one installation, the hostname must match ControlMachine. Therefore, you must pass the -h slurmctl to docker at run time so that the hostnames match.

You can run the usual Slurm commands:

[root@slurmctl /]# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*      up 5-00:00:00      5   idle c[1-5]
debug        up 5-00:00:00      5   idle c[6-10]
[root@slurmctl /]# scontrol show partition normal
PartitionName=normal
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=YES QoS=N/A
   DefaultTime=5-00:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=5-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=c[1-5]
   PriorityJobFactor=50 PriorityTier=50 RootOnly=NO ReqResv=NO OverSubscribe=NO PreemptMode=OFF
   State=UP TotalCPUs=5 TotalNodes=5 SelectTypeParameters=NONE
   DefMemPerCPU=500 MaxMemPerNode=UNLIMITED

Building

Using Existing Tags

There are multiple versions of Slurm available, each with its own tag. To build a specific version of Slurm, checkout the tag that matches that version and build the Dockerfile:

git clone https://github.com/giovtorres/docker-centos7-slurm
git checkout <tag>
docker build -t docker-centos7-slurm .

Using Build Args

You can use docker's --build-arg option to customize the version of Slurm and the version(s) of Python at build time.

To specify the version of Slurm, assign a valid Slurm tag to the SLURM_TAG build argument:

docker build --build-arg SLURM_TAG="slurm-19-05-1-2" -t docker-centos7-slurm:19.05.1-2

To specify the version(s) of Python to include in the container, specify a space-delimited string of Python versions using the PYTHON_VERSIONS build argument:

docker build --build-arg PYTHON_VERSIONS="3.6 3.7" -t docker-centos7-slurm:py3

Using docker-compose

The included docker-compose file will run the cluster container in the background. The docker-compose file uses data volumes to store the slurm state between container runs. To start the cluster container, run:

docker-compose up -d

To execute commands in the container, use docker exec:

docker exec dockercentos7slurm_slurm_1 sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*      up 5-00:00:00      5   idle c[1-5]
debug        up 5-00:00:00      5   idle c[6-10]

docker exec dockercentos7slurm_slurm_1 sbatch --wrap="sleep 10"
Submitted batch job 27

docker exec dockercentos7slurm_slurm_1 squeue
            JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            27    normal     wrap     root  R       0:07      1 c1

To attach to the bash shell inside the running container, run:

docker attach dockercentos7slurm_slurm_1

Press Ctrl-p,Ctrl-q to detach from the container without killing the bash process and stopping the container.

To stop the cluster container, run:

docker-compose down

Testing Locally

Testinfra is used to build and run a Docker container test fixture. Run the tests with pytest:

pytest -v

docker-centos7-slurm's People

Contributors

asmacdo avatar dependabot[bot] avatar giovtorres avatar percyfal avatar stefan-k avatar tazend avatar uddmorningsun avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

docker-centos7-slurm's Issues

sacct not producing output due to missing mysql tables

I'm using the slurm container for various tests and would like to monitor the status of jobs using the sacct command. I fire up the container:

docker run -it -h ernie giovtorres/docker-centos7-slurm:latest
and submit a simple job:

[root@ernie /]# sbatch --wrap "sleep 60"

Submitted batch job 2

[root@ernie /]# squeue -l               
Fri Dec  8 09:41:47 2017
             JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
                 2    normal     wrap     root  RUNNING       0:08 5-00:00:00      1 c1

scontrol works fine:

[root@ernie /]# scontrol show job 2
JobId=2 JobName=wrap
   UserId=root(0) GroupId=root(0) MCS_label=N/A
   Priority=4294901759 Nice=0 Account=(null) QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:12 TimeLimit=5-00:00:00 TimeMin=N/A
   SubmitTime=2017-12-08T09:41:39 EligibleTime=2017-12-08T09:41:39
   StartTime=2017-12-08T09:41:39 EndTime=2017-12-13T09:41:39 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2017-12-08T09:41:39
   Partition=normal AllocNode:Sid=ernie:1
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=c1
   BatchHost=localhost
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=500M,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=500M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/
   StdErr=//slurm-2.out
   StdIn=/dev/null
   StdOut=//slurm-2.out
   Power=

However, sacct fails since the table 'slurm_acct_db.linux_job_table' doesn't exist:

[root@ernie /]# sacct         
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
[root@ernie /]# cat /var/log/slurm/slurmdbd.log |tail
[2017-12-08T09:42:11.526] debug4: This could happen often and is expected.
mysql_query failed: 1146 Table 'slurm_acct_db.linux_job_table' doesn't exist
insert into "linux_job_table" (id_job, mod_time, id_array_job, id_array_task, pack_job_id, pack_job_offset, id_assoc, id_qos, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, nodes_alloc, mem_req, `partition`, node_inx, array_task_str, array_task_pending, tres_alloc, tres_req, work_dir) values (2, UNIX_TIMESTAMP(), 0, 4294967294, 0, 4294967294, 0, 1, 0, 0, 'c1', 0, 7200, 1512726099, 1512726099, 1512726099, 'wrap', 0, 1, 4294901759, 1, 1, 9223372036854776308, 'normal', '0', NULL, 0, '1=1,2=500,3=18446744073709551614,4=1,5=1', '1=1,2=500,4=1', '/') on duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx), id_assoc=0, id_user=0, id_group=0, nodelist='c1', id_resv=0, timelimit=7200, time_submit=1512726099, time_eligible=1512726099, time_start=1512726099, mod_time=UNIX_TIMESTAMP(), job_name='wrap', track_steps=0, id_qos=1, state=greatest(state, 1), priority=4294901759, cpus_req=1, nodes_alloc=1, mem_req=9223372036854776308, id_array_job=0, id_array_task=4294967294, pack_job_id=0, pack_job_offset=4294967294, `partition`='normal', node_inx='0', array_task_str=NULL, array_task_pending=0, tres_alloc='1=1,2=500,3=18446744073709551614,4=1,5=1', tres_req='1=1,2=500,4=1', work_dir='/'
[2017-12-08T09:42:11.526] error: We should have gotten a new id: Table 'slurm_acct_db.linux_job_table' doesn't exist
[2017-12-08T09:42:11.526] error: It looks like the storage has gone away trying to reconnect
[2017-12-08T09:42:11.526] debug4: This could happen often and is expected.
mysql_query failed: 1146 Table 'slurm_acct_db.linux_job_table' doesn't exist
insert into "linux_job_table" (id_job, mod_time, id_array_job, id_array_task, pack_job_id, pack_job_offset, id_assoc, id_qos, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, nodes_alloc, mem_req, `partition`, node_inx, array_task_str, array_task_pending, tres_alloc, tres_req, work_dir) values (2, UNIX_TIMESTAMP(), 0, 4294967294, 0, 4294967294, 0, 1, 0, 0, 'c1', 0, 7200, 1512726099, 1512726099, 1512726099, 'wrap', 0, 1, 4294901759, 1, 1, 9223372036854776308, 'normal', '0', NULL, 0, '1=1,2=500,3=18446744073709551614,4=1,5=1', '1=1,2=500,4=1', '/') on duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx), id_assoc=0, id_user=0, id_group=0, nodelist='c1', id_resv=0, timelimit=7200, time_submit=1512726099, time_eligible=1512726099, time_start=1512726099, mod_time=UNIX_TIMESTAMP(), job_name='wrap', track_steps=0, id_qos=1, state=greatest(state, 1), priority=4294901759, cpus_req=1, nodes_alloc=1, mem_req=9223372036854776308, id_array_job=0, id_array_task=4294967294, pack_job_id=0, pack_job_offset=4294967294, `partition`='normal', node_inx='0', array_task_str=NULL, array_task_pending=0, tres_alloc='1=1,2=500,3=18446744073709551614,4=1,5=1', tres_req='1=1,2=500,4=1', work_dir='/'
[2017-12-08T09:42:11.526] error: We should have gotten a new id: Table 'slurm_acct_db.linux_job_table' doesn't exist
[2017-12-08T09:42:11.526] DBD_JOB_START: cluster not registered

I cloned the repo and modified some settings in slurm.conf, to no avail. I have little experience setting up slurm so I'm unsure what changes need to be applied.

The issue has been reported before (e.g. http://thread.gmane.org/gmane.comp.distributed.slurm.devel/6333 and https://bugs.schedmd.com/show_bug.cgi?id=1943) and one proposed solution is adding the table with sacctmgr:

sacctmgr add cluster linux

sacctmgr add account none,test Cluster=linux \
  Description="none" Organization="none"

sacctmgr add user da DefaultAccount=test

However, the first command hangs in the container.

Do you have any idea for a solution?

Cheers,

Per

signal handling

I think it might cause a problem that the /bin/bash process is pid 1 inside the container. When the container is requested to stop it takes 10 seconds to shut it down because bash does not propagate signaling as it should, ie. docker kills pid 1 inside the container.

Running supervisord directly as process 1 does not seem to mitigate the problem. The container still need a proper init. See citation below.

Supervisor is a client/server system that allows its users to monitor and control a number of processes on UNIX-like operating systems.

It shares some of the same goals of programs like launchd, daemontools, and runit. Unlike some of these programs, it is not meant to be run as a substitute for init as “process id 1”. Instead it is meant to be used to control processes related to a project or a customer, and is meant to start like any other program at boot time.

Suggestion:

Use docker-compose 2.4 format and init: true, or, include tini explicitly in the Dockerfile prepended to the ENTRYPOINT.

How can I run multiple jobs at the same time?

I have submitted 2 jobs but they don't run at the same time:

[root@slurmctl /]# sbatch -n1 --wrap="sleep 10"
Submitted batch job 1
[root@slurmctl /]# sbatch -n1 --wrap="sleep 10"
Submitted batch job 2
[root@slurmctl /]# squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 2    normal     wrap     root PD       0:00      1 (Resources)
                 1    normal     wrap     root  R       0:09      1 c1

Job-2 should wait till Job-1 completed. Is there any way to let them run at the same time? Since there is 4 free node I though we could run 4 processes at the same time.

cgroups.conf

I am wondering if this setup could be used to simulate a slurm configuration using the task/affinity or task/cgroups mode for TaskPlugin. How do you feel @giovtorres ?

installing MPI

I would love to test some issues we see regarding the interplay of MPI and slurm. I am a bit new to docker, so I wonder how I would install (say) openmpi from the centos repos on the nodes which are spawned from the Dockerfile of this repo?

jobs in debug partition run indefinitely

Hi,

this probably relates back to #38 which addressed issue #37. The partitions are up, but if you submit the example job to the debug partition, it runs indefinitely, e.g:

docker exec dockercentos7slurm_slurm_1 sbatch --wrap="sleep 10" --partition debug

I added the following to tests/test_slurm.py:test_job_can_run:

time.sleep(2)
res = host.run(f'sacct -o State --parsable --noheader -j {jobid}')
assert "COMPLETE" in res.stdout

to verify that jobs complete; see github action output on my fork. Unfortunately I have no immediate solution but I thought I'd let you know. I make use of the debug partition in a CI test so I will see if I can find a fix.

Cheers,

Per

MariaDB did not start

I tried to run container with following docker-compose.yml:

version: '3'

services:
  slurm:
    image: giovtorres/docker-centos7-slurm:latest
    build: .
    hostname: ernie
    stdin_open: true
    tty: true
    volumes:
      - ./volumes/lib:/var/lib/slurmd
      - ./volumes/spool:/var/spool/slurmd
      - ./volumes/log:/var/log/slurm
      - ./volumes/db:/var/lib/mysql

but it fails on starting MariaDB:

$ docker-compose up
Creating docker-centos7-slurm_slurm_1 ... done
Attaching to docker-centos7-slurm_slurm_1
slurm_1  | - Initializing database
slurm_1  | - Database initialized
slurm_1  | - Updating MySQL directory permissions
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | 191017 08:41:46 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
slurm_1  | 191017 08:41:46 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | - Starting MariaDB to create Slurm account database
slurm_1  | MariaDB did not start
docker-centos7-slurm_slurm_1 exited with code 1

Did I miss something?

Hi,

Hi,
Thank you for this container, it's very useful. I would like to use it composing with my python application image (with lot of tools dependancies) which need slurm and mysql.
So how can do that? I probably need to modify the docker-compose.yml adding my appli as a service?
Any advices will be welcomed, best
Véronique

slurmctld not running

Hi,

Using the command

docker run -it -h ernie giovtorres/docker-centos7-slurm:17.02.9

the program 'slurmctld' is not running:

[root@ernie /]# supervisorctl status
munged                           RUNNING   pid 263, uptime 0:00:17
mysqld                           RUNNING   pid 491, uptime 0:00:13
slurmctld                        EXITED    Nov 13 12:50 PM
slurmd                           RUNNING   pid 262, uptime 0:00:17
slurmdbd                         RUNNING   pid 266, uptime 0:00:17

I am using docker 17.05.0-ce @ ubuntu 16.04.

Best regards,
Bernd

Can't connect to local MySQL server through socket '/run/mysqld/mysqld.sock'

Thank you for your work. I tried to convert the Dockerfile in ubuntu format.

But I keep getting following error from mysql:

$ /usr/bin/mysqld_safe
220422 10:39:58 mysqld_safe Logging to syslog.
220422 10:39:58 mysqld_safe Starting mariadbd daemon with databases from /var/lib/mysql
$ mysql
ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/run/mysqld/mysqld.sock' (2)

Do you have any idea how could this be solved?

Update dockerhub to match README

Using dockerhub readme I couldn't get the cluster to start.

README (Works great, thanks!):
docker run -it -h slurmctl --cap-add sys_admin giovtorres/docker-centos7-slurm:latest

Dockerhub (Failure output below):
docker run -it -h ernie giovtorres/docker-centos7-slurm:latest

This fails, I suppose the hostname must be slurmctl.

- Initializing database
- Database initialized
- Starting MariaDB to create Slurm account database
231031 13:57:20 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
231031 13:57:20 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
- Starting MariaDB to create Slurm account database
- Creating Slurm acct database
- Slurm acct database created. Stopping MariaDB
- Starting supervisord process manager
- Starting munged
munged: started
- munged is in the RUNNING state.
- Starting mysqld
mysqld: started
- mysqld is in the RUNNING state.
- Starting slurmdbd
slurmdbd: started
- slurmdbd is in the RUNNING state.
- Starting slurmctld
slurmctld: ERROR (spawn error)
- slurmctld is in the BACKOFF state.
- slurmctld is in the STARTING state.
- slurmctld is in the BACKOFF state.
- slurmctld is in the BACKOFF state.
- slurmctld is in the STARTING state.
- slurmctld is in the FATAL state.
- slurmctld is in the FATAL state.
- slurmctld is in the FATAL state.
- slurmctld is in the FATAL state.
- slurmctld is in the FATAL state.
- slurmctld is in the FATAL state.
- Starting slurmd
slurmd: ERROR (spawn error)
- slurmd is in the BACKOFF state.
- slurmd is in the STARTING state.
- slurmd is in the BACKOFF state.
- slurmd is in the BACKOFF state.
- slurmd is in the STARTING state.
- slurmd is in the FATAL state.
- slurmd is in the FATAL state.
- slurmd is in the FATAL state.
- slurmd is in the FATAL state.
- slurmd is in the FATAL state.
- slurmd is in the FATAL state.
- Port 6817 is not listening
- Port 6817 is not listening
- Port 6817 is not listening
- Port 6817 is not listening
- Port 6817 is not listening
- Port 6817 is not listening
- Port 6817 is not listening
- Port 6817 is not listening
- Port 6817 is not listening
- Port 6817 is not listening
- Port 6817 is not listening
- Port 6818 is not listening
- Port 6818 is not listening
- Port 6818 is not listening
- Port 6818 is not listening
- Port 6818 is not listening
- Port 6818 is not listening
- Port 6818 is not listening
- Port 6818 is not listening
- Port 6818 is not listening
- Port 6818 is not listening
- Port 6818 is not listening
- Port 6819 is listening
- Waiting for the cluster to become available
sinfo: error: get_addr_info: getaddrinfo() failed: Name or service not known
sinfo: error: slurm_set_addr: Unable to resolve "slurmctl"
sinfo: error: Unable to establish control machine address
slurm_load_partitions: No such file or directory
sinfo: error: get_addr_info: getaddrinfo() failed: Name or service not known
sinfo: error: slurm_set_addr: Unable to resolve "slurmctl"
sinfo: error: Unable to establish control machine address
slurm_load_partitions: No such file or directory
sinfo: error: get_addr_info: getaddrinfo() failed: Name or service not known
sinfo: error: slurm_set_addr: Unable to resolve "slurmctl"
sinfo: error: Unable to establish control machine address
slurm_load_partitions: No such file or directory
sinfo: error: get_addr_info: getaddrinfo() failed: Name or service not known
sinfo: error: slurm_set_addr: Unable to resolve "slurmctl"
sinfo: error: Unable to establish control machine address
slurm_load_partitions: No such file or directory
sinfo: error: get_addr_info: getaddrinfo() failed: Name or service not known
sinfo: error: slurm_set_addr: Unable to resolve "slurmctl"
sinfo: error: Unable to establish control machine address
slurm_load_partitions: No such file or directory
sinfo: error: get_addr_info: getaddrinfo() failed: Name or service not known
sinfo: error: slurm_set_addr: Unable to resolve "slurmctl"
sinfo: error: Unable to establish control machine address
slurm_load_partitions: No such file or directory
sinfo: error: get_addr_info: getaddrinfo() failed: Name or service not known
sinfo: error: slurm_set_addr: Unable to resolve "slurmctl"
sinfo: error: Unable to establish control machine address
slurm_load_partitions: No such file or directory
sinfo: error: get_addr_info: getaddrinfo() failed: Name or service not known
sinfo: error: slurm_set_addr: Unable to resolve "slurmctl"
sinfo: error: Unable to establish control machine address
slurm_load_partitions: No such file or directory
sinfo: error: get_addr_info: getaddrinfo() failed: Name or service not known
sinfo: error: slurm_set_addr: Unable to resolve "slurmctl"
sinfo: error: Unable to establish control machine address
slurm_load_partitions: No such file or directory
sinfo: error: get_addr_info: getaddrinfo() failed: Name or service not known
sinfo: error: slurm_set_addr: Unable to resolve "slurmctl"
sinfo: error: Unable to establish control machine address
slurm_load_partitions: No such file or directory
sinfo: error: get_addr_info: getaddrinfo() failed: Name or service not known
sinfo: error: slurm_set_addr: Unable to resolve "slurmctl"
sinfo: error: Unable to establish control machine address
slurm_load_partitions: No such file or directory

Slurm partitions failed to start successfully.

(auto)update docker hub images?

ATM tagged images are 2 years old although recent PRs were merged. Would be nice to get them updated. I see two possible ways to setup auto-updating (to avoid manual pains)

WDYT @giovtorres ?

Connecting to pyslurm

Hello Giovanni!

I find this docker container really useful and I am interested to use this for development with pyslurm as well. I am not sure how to connect the two though. I can not install pyslurm with "pip install pyslurm" inside this container. Do you have any suggestions about how to make it work with this (the Slurm headers and libraries) and pyslurm?

Have a nice day!
Jakob

Node states are unknown

Hi,

I'm using docker-centos7-slurm to test a workflow manager. It has been a while since updating, but when trying out the most recent version, I notice that only one node (c1) is up in the container. I am currently testing this in my fork (see pr #1). Briefly, I parametrized test_job_can_run to pass partition to the --partition option. The normal partition works as expected, but debug fails.

If one enters the latest image with

docker run -it -h slurmctl giovtorres/docker-centos7-slurm:latest

running sinfo yields

NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON              
c1             1   normal*        idle 1       1:1:1   1000        0      1   (null) none                
c2             1   normal*    unknown* 1       1:1:1   1000        0      1   (null) none                
c3             1     debug    unknown* 1       1:1:1   1000        0      1   (null) none                
c4             1     debug    unknown* 1       1:1:1   1000        0      1   (null) none                

See github action resuts, where I added some print statements to see what was going on (nevermind that the test actually passed; I was simply looking at the erroneous slurm output file). I consistently get the feedback that the required nodes are not available; it would seem node c1 is the only node available to sbatch.

Are you able to reproduce this?

Cheers,

Per

srun: job 1 queued and waiting for resources

Everything is OK and I try to place a job with srun command: srun -n 32 slurmctl
But the docker diaplay srun: job 1 queued and waiting for resources forever. What's the reason?

Lmod / module support

I just added Lmod support and an example hello modulefile, that adds in hello-world script that does the obvious.

While this isn't exactly Slurm, I guess many Slurm setups use this, so would you be interested in taking it upstream?

It is findable here: https://github.com/AaltoSciComp/docker-centos7-slurm/ (on master currently, no direct link to the commit since I am likely to rebase)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.