Git Product home page Git Product logo

Comments (40)

rohitramu avatar rohitramu commented on August 28, 2024

Hi,

Could you please share the blueprint you used to deploy the cluster?

Kind regards,
Rohit

from hpc-toolkit.

sharif-cameco avatar sharif-cameco commented on August 28, 2024

Hi

I did not understand, did you mean the terraform Plan?

Regards
Sharif

from hpc-toolkit.

sharif-cameco avatar sharif-cameco commented on August 28, 2024

SLURM1.20-Plan1.txt
SLURM1.20-Plan2.txt

Hi

I have attached two plan because first time it was having error deploying VMs with public IPs that is prohibited by the organization policy.

So I set the disable_public_ip = true and then deployed again.

Rehards
Sharif

from hpc-toolkit.

cboneti avatar cboneti commented on August 28, 2024

Hi, I believe Rohit meant the YAML input file to ghpc.

from hpc-toolkit.

sharif-cameco avatar sharif-cameco commented on August 28, 2024

pre-commit-config.yaml.txt
Hi

Did you mean this file?

Regards
Sharif

from hpc-toolkit.

cboneti avatar cboneti commented on August 28, 2024

When you deployed your environment, you did run:
ghpc create <blueprint.yaml> and then ghpc deploy <deployment folder>, right?
what we meant here is the <blueprint.yaml> file.
If you did not follow these steps, how did you deploy your environment?

from hpc-toolkit.

sharif-cameco avatar sharif-cameco commented on August 28, 2024

Hi

I followed this documentation
https://cloud.google.com/hpc-toolkit/docs/quickstarts/slurm-cluster

And there is nothing about blueprint.yaml. So I just use the make command, did not use the create command.

HPC-Deployment-Instruction

Regards
Sharif

from hpc-toolkit.

cboneti avatar cboneti commented on August 28, 2024

After that step, there is a ./ghpc create examples/hpc-slurm.yaml \ -l ERROR --vars project_id=PROJECT_ID, followed by ./ghpc deploy hpc-small.
I am assuming you executed these as well..

In that case, I would recommend at the logs in the controller node (not in the login node) (/var/logs/slurm/resume.py) and see why the nodes are not coming up. This can be a quota issue, a network issue, etc.

from hpc-toolkit.

sharif-cameco avatar sharif-cameco commented on August 28, 2024

Actually we have organization policy and VPC Service Control perimeter that doesn't allow us to deploy anything from any project. So I copied the primary folder from the hpc-small and ran it through the GitLab pipeline.

from hpc-toolkit.

cboneti avatar cboneti commented on August 28, 2024

Sorry, I am not sure I am following.
hpc-small/primary folder can be deployed directly through terraform (the instructions are available under the hpc-small folder. Did the GitLab pipeline execute the equivalent instructions? (terraform init and terraform apply). If so, that should still work and you should be able to look at the logs in the controller.

That being said, with strict organization policies, many things can happen and so the controller may fail to install due to lack of service account permissions, or the ability to connect to the internet (or at least GCS buckets)...

from hpc-toolkit.

sharif-cameco avatar sharif-cameco commented on August 28, 2024

Hi Ceboneti

Yes the GitLab pipeline has all the step to deploy anything by terraform like terraform init and apply.

All the necessary permissions are provided for the service accounts and VPC-SC ingress policy also updated for deploying VM image from outside of the organization and for accessing the bucket.

All the parameters in the yaml was changed in the terraform code according to our organization setup. Another parameter in the yaml should be added like filestore connect mode: "PRIVATE_SERVICE_ACCESS". Because it was having error deploying to the shared network, by default it is "DIRECT_PEERING". And you have no option to provide it in the yaml or in the main.tf.

Do I need to change the SLURM config file to set the controler IP?

Regards
Sharif

from hpc-toolkit.

sharif-cameco avatar sharif-cameco commented on August 28, 2024

Hi I did not find any slurm.conf file in the /etc/slurm location.

from hpc-toolkit.

sharif-cameco avatar sharif-cameco commented on August 28, 2024

[root@hpcsmall-controller slurm]# systemctl status -l slurmd
● slurmd.service - Slurm node daemon
Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Thu 2023-07-20 18:27:16 UTC; 3s ago
Process: 20902 ExecStart=/usr/local/sbin/slurmd -D -s $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
Main PID: 20902 (code=exited, status=1/FAILURE)

Jul 20 18:27:16 hpcsmall-controller systemd[1]: Started Slurm node daemon.
Jul 20 18:27:16 hpcsmall-controller slurmd[20902]: slurmd: fatal: Unable to determine this slurmd's NodeName
Jul 20 18:27:16 hpcsmall-controller systemd[1]: slurmd.service: main process exited, code=exited, status=1/FAILURE
Jul 20 18:27:16 hpcsmall-controller systemd[1]: Unit slurmd.service entered failed state.
Jul 20 18:27:16 hpcsmall-controller systemd[1]: slurmd.service failed.
[root@hpcsmall-controller slurm]#

from hpc-toolkit.

sharif-cameco avatar sharif-cameco commented on August 28, 2024

slurm.conf

https://slurm.schedmd.com/slurm.conf.html

https://slurm.schedmd.com/configurator.html

ProctrackType=proctrack/cgroup
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmdPidFile=/var/run/slurm/slurmd.pid
TaskPlugin=task/affinity,task/cgroup
MaxNodeCount=64000

SCHEDULING

SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory

LOGGING AND ACCOUNTING

AccountingStoreFlags=job_comment
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/cgroup
SlurmctldDebug=info
SlurmdDebug=info
DebugFlags=Power

TIMERS

MessageTimeout=60

################################################################################

vvvvv WARNING: DO NOT MODIFY SECTION BELOW vvvvv

################################################################################

SlurmctldHost=hpcsmall-controller(hpcsmall-controller.c.prj-n-005-cloudops-618d.internal)

AuthType=auth/munge
AuthInfo=cred_expire=120
AuthAltTypes=auth/jwt
CredType=cred/munge
MpiDefault=none
ReturnToService=2
SlurmctldPort=6820-6830
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurm

LOGGING AND ACCOUNTING

AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=hpcsmall-controller
ClusterName=hpcsmall
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd-%n.log

GENERATED CLOUD CONFIGURATIONS

include cloud.conf

################################################################################

^^^^^ WARNING: DO NOT MODIFY SECTION ABOVE ^^^^^

################################################################################

from hpc-toolkit.

sharif-cameco avatar sharif-cameco commented on August 28, 2024

I set the value for SlurmctldHost to the vm IP, localhost and 127.0.0.1. But nothing did work.

from hpc-toolkit.

rohitramu avatar rohitramu commented on August 28, 2024

From reading the discussion above, this is my understanding of the steps you've taken so far:

  1. Followed the tutorial at https://cloud.google.com/hpc-toolkit/docs/quickstarts/slurm-cluster.
  2. Ran ./ghpc create examples/hpc-slurm.yaml --vars project_id=<PROJECT_ID>, where you replaced <PROJECT_ID> with your own project ID.
  3. In your GitLab pipeline, you ran terraform init and terraform apply on the folder (named "primary"), which was generated by the ghpc create command in the previous step.

Please correct me if I misunderstood any of these steps.

Did you make changes/edits to either the examples/hpc-slurm.yaml file (this is what we refer to as the "blueprint" file), or the generated "primary" folder? If so, what exactly were those changes?

from hpc-toolkit.

sharif-cameco avatar sharif-cameco commented on August 28, 2024

Hi Rohit

All the points you mentioned are correct except number 2.

I did not change the examples/hpc-slurm.yaml, instead I made change in the terraform in the following files.

  1. variables.tf
    variables.tf.txt

  2. terraform.tfvars
    terraform.tfvars.txt

  3. main.tf
    main.tf.txt

The files are attached herewith. Few other small changes are as follows.

a. modules\embedded\modules\compute\vm-instance\variables.tf
variable "disable_public_ips" {
description = "If set to true, instances will not have public IPs"
type = bool
default = true
}

b. modules\embedded\modules\file-system\filestore\variables.tf
variable "connect_mode" {
description = "Used to select mode - supported values DIRECT_PEERING and PRIVATE_SERVICE_ACCESS."
type = string
default = "PRIVATE_SERVICE_ACCESS"
}

Regards
Sharif

from hpc-toolkit.

rohitramu avatar rohitramu commented on August 28, 2024

Could you please share the logs on the controller VM in the /var/log/slurm folder?

from hpc-toolkit.

sharif-cameco avatar sharif-cameco commented on August 28, 2024

Hi Rohit

The logs files having contents are attached herewith.

resume.log.txt
slurmctld.log..txt

Regards

Sharif

from hpc-toolkit.

rohitramu avatar rohitramu commented on August 28, 2024

I see this error in the file resume.log.txt:

bulkInsert API failures:

Followed by:

Request is prohibited by organization's policy.

Could you please double check that the Bulk Insert API is enabled in your project and that your service account has the necessary permissions on the Bulk Insert API?

from hpc-toolkit.

tpdownes avatar tpdownes commented on August 28, 2024

@sharif-cameco another possible explanation for the organization policy rejection is that you are attempting to create a resource (probably the VM itself) in a region you're not allowed to create resources in.

Have you made any progress since your last report?

from hpc-toolkit.

sharif-cameco avatar sharif-cameco commented on August 28, 2024

egressViolations: [
0: {
servicePerimeter: "accessPolicies/994499912398/servicePerimeters/cameco_primary_perimeter"
source: "projects/1093909091653"
sourceType: "Resource"
targetResource: "projects/pd-standard"
}]
resourceNames: [
0: "projects/schedmd-slurm-public/global/images/family/slurm-gcp-5-7-hpc-centos-7"
1: "pd-standard"
2: "https://www.googleapis.com/compute/v1/projects/prj-n-shared-restricted-8521/regions/northamerica-northeast1/subnetworks/sb-n-shared-restricted-northamerica-northeast1"
3: "https://www.googleapis.com/compute/v1/projects/prj-n-005-cloudops-618d/global/instanceTemplates/hpcsmall-compute-debug-ghpc-20230718182832730600000003"
]

Can you please provide me the project number of the project "schedmd-slurm-public"? Otherwise I cannot allow the image source in the VPC-SC egress policy.

from hpc-toolkit.

sharif-cameco avatar sharif-cameco commented on August 28, 2024

Hi Tom

I actually creating the VM in same region the controller and login node were created. So this is not the issue. And I did not find any error in the GCP log regarding the api failure in the controller. Only the VPC error regarding the accessing the image.

Another issue might be if the APIs use the port other that 80 and 443. Because all other ports are blocked for egress in the firewall.

Please sugest.

Regards
Sharif

from hpc-toolkit.

tpdownes avatar tpdownes commented on August 28, 2024

Your error message (image denial) is either the explanation or part of it. I think you should be able to add the project as trusted with just the ID:

https://cloud.google.com/compute/docs/images/restricting-image-access

If that doesn't work, I will have to get the # from SchedMD and communicate it privately to you.

An alternative (which might also be denied by policy) would be to copy the image into your own project:

gcloud compute images create my-new-image --source-image-family=slurm-gcp-6-0-hpc-rocky-linux-8 --source-image-project=schedmd-slurm-public

Then specify:

instance_image:
  family: slurm-gcp-6-0-hpc-rocky-linux-8
  project: your-project

keep in mind that the Slurm image is coupled to the module release version so you would have to repeat this.

from hpc-toolkit.

sharif-cameco avatar sharif-cameco commented on August 28, 2024

Hi Tom

Where do I change the instance_mage: info?

modules\embedded\modules\compute\vm-instance\main.tf ?

Everywhere in the code they used:

variable "instance_image" {
description = "Instance Image"
type = object({
family = string,
project = string
})
default = {
family = "hpc-centos-7"
project = "cloud-hpc-image-public"
}
}

So should I change everywhere in the entire modules and community folders?

Regards
Sharif

from hpc-toolkit.

tpdownes avatar tpdownes commented on August 28, 2024

An option is to add the value to vars as shown here:

vars:
  ... (existing values)
  instance_image:
    family: slurm-gcp-6-0-hpc-rocky-linux-8
    project: your-project

Then it will automatically be applied to any module that has instance_image input. You were able to copy the image to your project?

from hpc-toolkit.

sharif-cameco avatar sharif-cameco commented on August 28, 2024

Hi Tom

I created the image locally and was able to deploy with that image. But there are few deployment error though the Login and Controller were deployed.

Error: local-exec provisioner error

│ with module.slurm_controller.module.slurm_controller_instance.module.reconfigure_critical[0].null_resource.destroy_nodes_on_create[0],
│ on .terraform/modules/slurm_controller.slurm_controller_instance/terraform/slurm_cluster/modules/slurm_destroy_nodes/main.tf line 52, in resource "null_resource" "destroy_nodes_on_create":
│ 52: provisioner "local-exec" {

│ Error running command
│ '/builds/CamecoCorporation/cloud-operations/terraform-services/terraform-gcp-005-cloudops/nonproduction/.terraform/modules/slurm_controller.slurm_controller_instance/scripts/destroy_nodes.py

│ "--project_id=prj-n-005-cloudops-618d"


│ 'hpccameco'
│ ': exit status 127. Output: env: can't execute 'python3': No such file or
│ directory



│ Error: local-exec provisioner error

│ with module.slurm_controller.module.slurm_controller_instance.module.reconfigure_notify[0].null_resource.notify_cluster,
│ on .terraform/modules/slurm_controller.slurm_controller_instance/terraform/slurm_cluster/modules/slurm_notify_cluster/main.tf line 52, in resource "null_resource" "notify_cluster":
│ 52: provisioner "local-exec" {

│ Error running command
│ '/builds/CamecoCorporation/cloud-operations/terraform-services/terraform-gcp-005-cloudops/nonproduction/.terraform/modules/slurm_controller.slurm_controller_instance/scripts/notify_cluster.py
│ --type='reconfig' --project_id='prj-n-005-cloudops-618d'
│ 'hpccameco-slurm-events-fSUHfyb8'': exit status 127. Output: env: can't
│ execute 'python3': No such file or directory

from hpc-toolkit.

sharif-cameco avatar sharif-cameco commented on August 28, 2024

When I SSH to the controller vm got the following error message.
"*** Slurm setup failed! Please view log: /slurm/scripts/setup.log ***"

And here is the log content in the setup.log as follows.

2023-07-31 18:35:24,036 INFO: Setting up controller
2023-07-31 18:35:24,039 DEBUG: get_metadata: metadata not found (http://metadata.google.internal/computeMetadata/v1/instance/attributes/slurm_bucket_path)
2023-07-31 18:35:24,039 ERROR: failed to get_metadata from http://metadata.google.internal/computeMetadata/v1/instance/attributes/slurm_bucket_path
Traceback (most recent call last):
File "/slurm/scripts/util.py", line 696, in get_metadata
resp.raise_for_status()
File "/usr/local/lib/python3.6/site-packages/requests/models.py", line 960, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http://metadata.google.internal/computeMetadata/v1/instance/attributes/slurm_bucket_path

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/slurm/scripts/setup.py", line 874, in
main(args)
File "/slurm/scripts/setup.py", line 850, in main
setup(args)
File "/slurm/scripts/setup.py", line 692, in setup_controller
install_custom_scripts()
File "/slurm/scripts/setup.py", line 158, in install_custom_scripts
blobs = list(chain.from_iterable(blob_list(prefix=p) for p in prefixes))
File "/slurm/scripts/setup.py", line 158, in
blobs = list(chain.from_iterable(blob_list(prefix=p) for p in prefixes))
File "/slurm/scripts/util.py", line 286, in blob_list
uri = instance_metadata("attributes/slurm_bucket_path")
File "/slurm/scripts/util.py", line 706, in instance_metadata
return get_metadata(path, root=f"{ROOT_URL}/instance")
File "/slurm/scripts/util.py", line 700, in get_metadata
raise Exception(f"failed to get_metadata from {url}")
Exception: failed to get_metadata from http://metadata.google.internal/computeMetadata/v1/instance/attributes/slurm_bucket_path
2023-07-31 18:35:24,041 ERROR: Aborting setup...
~

from hpc-toolkit.

rohitramu avatar rohitramu commented on August 28, 2024

Hi Sharif,

I'm assuming you got the 'python3': No such file or directory error on the machine where you ran terraform apply. Could you please double check that python3 is installed and available (i.e. run python3 --version) on that machine (i.e. run python3 --version)?

I think one thing that will help us to debug faster is if we use HPC Toolkit blueprint (yaml) files rather than the generated Terraform directly - it will be easier for me to understand and reproduce your setup. It looks like you started off with the hpc-slurm blueprint file from the HPC Toolkit repo, and then needed to modify it to make it work in your environment. If I understood correctly, these are the changes you needed to make to the hpc-slurm.yaml blueprint file:

  • Use a copy of the slurm-gcp-6-0-hpc-rocky-linux-8 VM image which exists in your own project.
  • For the filestore, change the connect_mode to "PRIVATE_SERVICE_ACCESS".
  • Set disable_controller_public_ips and disable_login_public_ips to true (this is the default).
  • On the compute nodes, set disable_public_ips to true (this is the default).

I think it might look something like this (make sure to update the TODO: values and save it as hpc-slurm.yaml):

blueprint_name: hpc-slurm

vars:
  project_id:  ## TODO: Set GCP Project ID Here ##
  deployment_name: hpc-small
  region: us-west4 # TODO: Update region
  zone: us-west4-c # TODO: Update zone
  instance_image:
    family: slurm-gcp-6-0-hpc-rocky-linux-8
    project: schedmd-slurm-public ## TODO: Set GCP Project ID Here ##

# Documentation for each of the modules used below can be found at
# https://github.com/GoogleCloudPlatform/hpc-toolkit/blob/main/modules/README.md

deployment_groups:
- group: primary
  modules:
  # Source is an embedded resource, denoted by "resources/*" without ./, ../, /
  # as a prefix. To refer to a local resource, prefix with ./, ../ or /
  # Example - ./resources/network/vpc
  - id: network1
    source: modules/network/vpc

  - id: homefs
    source: modules/file-system/filestore
    use: [network1]
    settings:
      local_mount: /home
      connect_mode: "PRIVATE_SERVICE_ACCESS"

  - id: debug_node_group
    source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
    settings:
      node_count_dynamic_max: 4
      machine_type: n2-standard-2

  - id: debug_partition
    source: community/modules/compute/schedmd-slurm-gcp-v5-partition
    use:
    - network1
    - homefs
    - debug_node_group
    settings:
      partition_name: debug
      exclusive: false # allows nodes to stay up after jobs are done
      enable_placement: false # the default is: true
      is_default: true

  - id: compute_node_group
    source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
    settings:
      node_count_dynamic_max: 20

  - id: compute_partition
    source: community/modules/compute/schedmd-slurm-gcp-v5-partition
    use:
    - network1
    - homefs
    - compute_node_group
    settings:
      partition_name: compute

  - id: slurm_controller
    source: community/modules/scheduler/schedmd-slurm-gcp-v5-controller
    use:
    - network1
    - debug_partition
    - compute_partition
    - homefs
    settings:

  - id: slurm_login
    source: community/modules/scheduler/schedmd-slurm-gcp-v5-login
    use:
    - network1
    - slurm_controller
    settings:
      machine_type: n2-standard-4

In your GitLab pipeline, you can deploy it like this (you can save this as a shell script and run it in the pipeline):

git clone https://github.com/GoogleCloudPlatform/hpc-toolkit.git
cd hpc-toolkit
git fetch --all --tags
git checkout tags/v1.21.0 -b working_branch
make ghpc
cd ..
cp hpc-toolkit/ghpc .
./ghpc create hpc-slurm.yaml -w
./ghpc deploy hpc-small/ --auto-approve

Let me know how it goes!

Kind regards,
Rohit

from hpc-toolkit.

sharif-cameco avatar sharif-cameco commented on August 28, 2024

Hi Rohit

I used the blueprint attached herewith. Please check and let me know if there is anything wrong.

Regards
Sharif[
hpc-enterprise-slurm.yaml.txt

from hpc-toolkit.

rohitramu avatar rohitramu commented on August 28, 2024

Thanks for sending that Sharif. Is that the blueprint which gave you the execute 'python3': No such file or directory error?

Were you able to confirm that python3 is installed and available on the machine in your GitLab pipeline?

from hpc-toolkit.

sharif-cameco avatar sharif-cameco commented on August 28, 2024

Hi

I installed python and required packages in the script of gitlab-ci.yaml. I used the blueprint I sent you earlier
Everything is ok with deployment now.

But having issue with cluster configuration. Please see the message below.

*** Slurm setup failed! Please view log: /slurm/scripts/setup.log ***

Creating directory '/home/issharif_c_cameco_com'.
[issharif_c_cameco_com@hpccameco-controller ~]$ srun -N 3 hostname
srun: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
srun: error: fetch_config: DNS SRV lookup failed
srun: error: _establish_config_source: failed to fetch config
srun: fatal: Could not establish a configuration source
[issharif_c_cameco_com@hpccameco-controller ~]$

from hpc-toolkit.

sharif-cameco avatar sharif-cameco commented on August 28, 2024

Hi Rohit

Also note that we don't use GCP DNS but our own instead. So the GCP provided DNS name will not work here. We need to use IP for the controller in the config file.

Regards
Sharif

from hpc-toolkit.

sharif-cameco avatar sharif-cameco commented on August 28, 2024

[issharif_c_cameco_com@hpccameco-controller ~]$ sudo systemctl enable slurmd
Created symlink /etc/systemd/system/multi-user.target.wants/slurmd.service → /usr/lib/systemd/system/slurmd.service.
[issharif_c_cameco_com@hpccameco-controller ~]$ sudo systemctl start slurmd
[issharif_c_cameco_com@hpccameco-controller ~]$ sudo systemctl status slurmd
● slurmd.service - Slurm node daemon
Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Wed 2023-08-02 18:07:59 UTC; 4s ago
Process: 1591 ExecStart=/usr/local/sbin/slurmd -D -s $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
Main PID: 1591 (code=exited, status=1/FAILURE)

Aug 02 18:07:59 hpccameco-controller slurmd[1591]: slurmd: error: resolve_ctls_from_dns_srv: res_nsearch error>
Aug 02 18:07:59 hpccameco-controller slurmd[1591]: slurmd: error: fetch_config: DNS SRV lookup failed
Aug 02 18:07:59 hpccameco-controller slurmd[1591]: slurmd: error: _establish_configuration: failed to load con>
Aug 02 18:07:59 hpccameco-controller slurmd[1591]: slurmd: error: slurmd initialization failed
Aug 02 18:07:59 hpccameco-controller slurmd[1591]: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknow>
Aug 02 18:07:59 hpccameco-controller slurmd[1591]: error: fetch_config: DNS SRV lookup failed
Aug 02 18:07:59 hpccameco-controller systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FA>
Aug 02 18:07:59 hpccameco-controller slurmd[1591]: error: _establish_configuration: failed to load configs
Aug 02 18:07:59 hpccameco-controller systemd[1]: slurmd.service: Failed with result 'exit-code'.
Aug 02 18:07:59 hpccameco-controller slurmd[1591]: error: slurmd initialization failed
lines 1-16/16 (END)

from hpc-toolkit.

sharif-cameco avatar sharif-cameco commented on August 28, 2024

[issharif_c_cameco_com@hpccameco-controller etc]$ sudo systemctl status munge
● munge.service - MUNGE authentication service
Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Wed 2023-08-02 17:49:45 UTC; 26min ago
Docs: man:munged(8)

Aug 02 17:49:45 slurm-gcp-dev-hpc-rocky-linux-8-1689802763 systemd[1]: Starting MUNGE authentication service...
Aug 02 17:49:45 slurm-gcp-dev-hpc-rocky-linux-8-1689802763 munged[768]: munged: Error: Failed to check keyfile>
Aug 02 17:49:45 slurm-gcp-dev-hpc-rocky-linux-8-1689802763 systemd[1]: munge.service: Control process exited, >
Aug 02 17:49:45 slurm-gcp-dev-hpc-rocky-linux-8-1689802763 systemd[1]: munge.service: Failed with result 'exit>
Aug 02 17:49:45 slurm-gcp-dev-hpc-rocky-linux-8-1689802763 systemd[1]: Failed to start MUNGE authentication se>
lines 1-10/10 (END)

from hpc-toolkit.

sharif-cameco avatar sharif-cameco commented on August 28, 2024

Hi Rohit

The munge key file /etc/munge/munge.key is unavailable and I have attached the output of journalctl -xe.
journalctl.txt

There is a certificate error in the output and I see the same error in the GCP log. I have attached the gcp log as well.

gcp-log-certificate-error.txt

Regards
Sharif

from hpc-toolkit.

cboneti avatar cboneti commented on August 28, 2024

Hi Sharif,

I have not seen this problem before.
I am hoping that someone from GCP will reach out to you so we can plan a session together.

Thanks,

Carlos

from hpc-toolkit.

cboneti avatar cboneti commented on August 28, 2024

Hi,

Another idea, in the meantime, have you tried deploying the simpler hpc-slurm.yaml blueprint? that would rule out problemas that come from the service accounts setup.

from hpc-toolkit.

sharif-cameco avatar sharif-cameco commented on August 28, 2024

Hi Carlos

I redeployed the HPC with the new image.
There is no startup error message and the MUNGE service is also running. But the SLURMD failed.

[issharif_c_cameco_com@hpccameco-controller ~]$ sudo systemctl status slurmd -l
● slurmd.service - Slurm node daemon
Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Wed 2023-08-09 22:11:24 UTC; 3min 52s ago
Main PID: 3136 (code=exited, status=1/FAILURE)

Aug 09 22:11:24 hpccameco-controller systemd[1]: Started Slurm node daemon.
Aug 09 22:11:24 hpccameco-controller slurmd[3136]: slurmd: fatal: Unable to determine this slurmd's NodeName
Aug 09 22:11:24 hpccameco-controller systemd[1]: slurmd.service: main process exited, code=exited, status=1/FAILURE
Aug 09 22:11:24 hpccameco-controller systemd[1]: Unit slurmd.service entered failed state.
Aug 09 22:11:24 hpccameco-controller systemd[1]: slurmd.service failed.
[issharif_c_cameco_com@hpccameco-controller ~]$

Regards
Sharif

from hpc-toolkit.

cboneti avatar cboneti commented on August 28, 2024

Hi Sharif,

I will close this issue as we have moved past the initial problems you had.
Please open another issue if the new setup doesn't work for you.

Thanks!
Carlos

from hpc-toolkit.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.