Git Product home page Git Product logo

googlecloudplatform / hpc-toolkit Goto Github PK

View Code? Open in Web Editor NEW
175.0 34.0 110.0 26.57 MB

Cloud HPC Toolkit is an open-source software offered by Google Cloud which makes it easy for customers to deploy HPC environments on Google Cloud.

License: Apache License 2.0

Makefile 0.32% Go 16.30% HCL 39.75% Shell 6.27% Dockerfile 0.26% Perl 0.04% Smarty 1.57% Python 23.56% Jinja 0.25% CSS 0.04% JavaScript 0.68% HTML 10.80% Jupyter Notebook 0.14% PowerShell 0.01%

hpc-toolkit's Introduction

Google HPC-Toolkit

Description

HPC Toolkit is an open-source software offered by Google Cloud which makes it easy for customers to deploy HPC environments on Google Cloud.

HPC Toolkit allows customers to deploy turnkey HPC environments (compute, networking, storage, etc.) following Google Cloud best-practices, in a repeatable manner. The HPC Toolkit is designed to be highly customizable and extensible, and intends to address the HPC deployment needs of a broad range of customers.

Detailed documentation and examples

The Toolkit comes with a suite of tutorials, examples, and full documentation for a suite of modules that have been designed for HPC use cases. More information can be found on the Google Cloud Docs.

Quickstart

Running through the quickstart tutorial is the recommended path to get started with the HPC Toolkit.


If a self directed path is preferred, you can use the following commands to build the ghpc binary:

git clone https://github.com/GoogleCloudPlatform/hpc-toolkit
cd hpc-toolkit
make
./ghpc --version
./ghpc --help

NOTE: You may need to install dependencies first.

HPC Toolkit Components

Learn about the components that make up the HPC Toolkit and more on how it works on the Google Cloud Docs Product Overview.

GCP Credentials

Supplying cloud credentials to Terraform

Terraform can discover credentials for authenticating to Google Cloud Platform in several ways. We will summarize Terraform's documentation for using gcloud from your workstation and for automatically finding credentials in cloud environments. We do not recommend following Hashicorp's instructions for downloading service account keys.

Cloud credentials on your workstation

You can generate cloud credentials associated with your Google Cloud account using the following command:

gcloud auth application-default login

You will be prompted to open your web browser and authenticate to Google Cloud and make your account accessible from the command-line. Once this command completes, Terraform will automatically use your "Application Default Credentials."

If you receive failure messages containing "quota project" you should change the quota project associated with your Application Default Credentials with the following command and provide your current project ID as the argument:

gcloud auth application-default set-quota-project ${PROJECT-ID}

Cloud credentials in virtualized cloud environments

In virtualized settings, the cloud credentials of accounts can be attached directly to the execution environment. For example: a VM or a container can have service accounts attached to them. The Google Cloud Shell is an interactive command line environment which inherits the credentials of the user logged in to the Google Cloud Console.

Many of the above examples are easily executed within a Cloud Shell environment. Be aware that Cloud Shell has several limitations, in particular an inactivity timeout that will close running shells after 20 minutes. Please consider it only for blueprints that are quickly deployed.

VM Image Support

Standard Images

The HPC Toolkit officially supports the following VM images:

  • HPC CentOS 7
  • HPC Rocky Linux 8
  • Debian 11
  • Ubuntu 20.04 LTS

For more information on these and other images, see docs/vm-images.md.

Slurm Images

Warning: Slurm Terraform modules cannot be directly used on the standard OS images. They must be used in combination with images built for the versioned release of the Terraform module.

The HPC Toolkit provides modules and examples for implementing pre-built and custom Slurm VM images, see Slurm on GCP

Blueprint Validation

The Toolkit contains "validator" functions that perform basic tests of the blueprint to ensure that deployment variables are valid and that the HPC environment can be provisioned in your Google Cloud project. Further information can be found in dedicated documentation.

Enable GCP APIs

In a new GCP project there are several APIs that must be enabled to deploy your HPC cluster. These will be caught when you perform terraform apply but you can save time by enabling them upfront.

See Google Cloud Docs for instructions.

GCP Quotas

You may need to request additional quota to be able to deploy and use your HPC cluster.

See Google Cloud Docs for more information.

Billing Reports

You can view your billing reports for your HPC cluster on the Cloud Billing Reports page. ​​To view the Cloud Billing reports for your Cloud Billing account, including viewing the cost information for all of the Cloud projects that are linked to the account, you need a role that includes the billing.accounts.getSpendingInformation permission on your Cloud Billing account.

To view the Cloud Billing reports for your Cloud Billing account:

  1. In the Google Cloud Console, go to Navigation Menu > Billing.
  2. At the prompt, choose the Cloud Billing account for which you'd like to view reports. The Billing Overview page opens for the selected billing account.
  3. In the Billing navigation menu, select Reports.

In the right side, expand the Filters view and then filter by label, specifying the key ghpc_deployment (or ghpc_blueprint) and the desired value.

Troubleshooting

Authentication

Confirm that you have properly setup Google Cloud credentials

Slurm Clusters

Please see the dedicated troubleshooting guide for Slurm.

Terraform Deployment

When terraform apply fails, Terraform generally provides a useful error message. Here are some common reasons for the deployment to fail:

  • GCP Access: The credentials being used to call terraform apply do not have access to the GCP project. This can be fixed by granting access in IAM & Admin.
  • Disabled APIs: The GCP project must have the proper APIs enabled. See Enable GCP APIs.
  • Insufficient Quota: The GCP project does not have enough quota to provision the requested resources. See GCP Quotas.
  • Filestore resource limit: When regularly deploying Filestore instances with a new VPC you may see an error during deployment such as: System limit for internal resources has been reached. See this doc for the solution.
  • Required permission not found:
    • Example: Required 'compute.projects.get' permission for 'projects/... forbidden
    • Credentials may not be set, or are not set correctly. Please follow instructions at Cloud credentials on your workstation.
    • Ensure proper permissions are set in the cloud console IAM section.

Failure to Destroy VPC Network

If terraform destroy fails with an error such as the following:

│ Error: Error when reading or editing Subnetwork: googleapi: Error 400: The subnetwork resource 'projects/<project_name>/regions/<region>/subnetworks/<subnetwork_name>' is already being used by 'projects/<project_name>/zones/<zone>/instances/<instance_name>', resourceInUseByAnotherResource

or

│ Error: Error waiting for Deleting Network: The network resource 'projects/<project_name>/global/networks/<vpc_network_name>' is already being used by 'projects/<project_name>/global/firewalls/<firewall_rule_name>'

These errors indicate that the VPC network cannot be destroyed because resources were added outside of Terraform and that those resources depend upon the network. These resources should be deleted manually. The first message indicates that a new VM has been added to a subnetwork within the VPC network. The second message indicates that a new firewall rule has been added to the VPC network. If your error message does not look like these, examine it carefully to identify the type of resource to delete and its unique name. In the two messages above, the resource names appear toward the end of the error message. The following links will take you directly to the areas within the Cloud Console for managing VMs and Firewall rules. Make certain that your project ID is selected in the drop-down menu at the top-left.

Inspecting the Deployment

The deployment will be created with the following directory structure:

<<OUTPUT_PATH>>/<<DEPLOYMENT_NAME>>/{<<DEPLOYMENT_GROUPS>>}/

If an output directory is provided with the --output/-o flag, the deployment directory will be created in the output directory, represented as <<OUTPUT_PATH>> here. If not provided, <<OUTPUT_PATH>> will default to the current working directory.

The deployment directory is created in <<OUTPUT_PATH>> as a directory matching the provided deployment_name deployment variable (vars) in the blueprint.

Within the deployment directory are directories representing each deployment group in the blueprint named the same as the group field for each element in deployment_groups.

In each deployment group directory, are all of the configuration scripts and modules needed to deploy. The modules are in a directory named modules named the same as the source module, for example the vpc module is in a directory named vpc.

A hidden directory containing meta information and backups is also created and named .ghpc.

From the hpc-slurm.yaml example, we get the following deployment directory:

hpc-slurm/
  primary/
    main.tf
    modules/
    providers.tf
    terraform.tfvars
    variables.tf
    versions.tf
  .ghpc/

Dependencies

See Cloud Docs on Installing Dependencies.

Notes on Packer

The Toolkit supports Packer templates in the contemporary HCL2 file format and not in the legacy JSON file format. We require the use of Packer 1.7.9 or above, and recommend using the latest release.

The Toolkit's Packer template module documentation describes input variables and their behavior. An image-building example and usage instructions are provided. The example integrates Packer, Terraform and startup-script runners to demonstrate the power of customizing images using the same scripts that can be applied at boot-time.

Development

The following setup is in addition to the dependencies needed to build and run HPC-Toolkit.

Please use the pre-commit hooks configured in this repository to ensure that all changes are validated, tested and properly documented before pushing code changes. The pre-commits configured in the HPC Toolkit have a set of dependencies that need to be installed before successfully passing.

Follow these steps to install and setup pre-commit in your cloned repository:

  1. Install pre-commit using the instructions from the pre-commit website.

  2. Install TFLint using the instructions from the TFLint documentation.

    NOTE: The version of TFLint must be compatible with the Google plugin version identified in tflint.hcl. Versions of the plugin >=0.20.0 should use tflint>=0.40.0. These versions are readily available via GitHub or package managers. Please review the TFLint Ruleset for Google Release Notes for up-to-date requirements.

  1. Install ShellCheck using the instructions from the ShellCheck documentation

  2. The other dev dependencies can be installed by running the following command in the project root directory:

    make install-dev-deps
  3. Pre-commit is enabled on a repo-by-repo basis by running the following command in the project root directory:

    pre-commit install

Now pre-commit is configured to automatically run before you commit.

Development on macOS

While macOS is a supported environment for building and executing the Toolkit, it is not supported for Toolkit development due to GNU specific shell scripts.

If developing on a mac, a workaround is to install GNU tooling by installing coreutils and findutils from a package manager such as homebrew or conda.

Contributing

Please refer to the contributing file in our GitHub repository, or to Google’s Open Source documentation.

hpc-toolkit's People

Contributors

alyssa-sm avatar cboneti avatar cdunbar13 avatar craig-wilson-nag avatar dependabot[bot] avatar douglasjacobsen avatar ek-nag avatar harshthakkar01 avatar heyealex avatar ikatsardis avatar issacg avatar jrossthomson avatar kkr16 avatar mark-olson avatar mattstreet-nag avatar max-nag avatar mittz avatar mr0re1 avatar nick-stroud avatar ningli avatar ptooley avatar rohitramu avatar saltysoup avatar scott-nag avatar skylermalinowski avatar thiagosgobe avatar tpdownes avatar wardharold avatar wkharold avatar ziwang492 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hpc-toolkit's Issues

Unable to use destroy cluster created using examples/hpc-cluster-small.yaml

Describe the bug

When destroying a cluster created using examples/hpc-cluster-small.yaml

terraform -chdir=<dir> destroy

everything works but the network is unable to be destroyed because it complains that the firewall is still using the network.

module.network1.module.vpc.module.vpc.google_compute_network.network: Destroying... [id=projects/climate-modeling-285614/global/networks/lopez-net]
module.network1.module.vpc.module.vpc.google_compute_network.network: Still destroying... [id=projects/climate-modeling-285614/global/networks/lopez-net, 10s elapsed]
╷
│ Error: Error waiting for Deleting Network: The network resource 'projects/climate-modeling-285614/global/networks/lopez-net' is already being used by 'projects/climate-modeling-285614/global/firewalls/lopez-net-iyvrmjzsszuzmmbd5kdofnu3-4'

Steps to reproduce

Steps to reproduce the behavior:

  1. Create a cluster
  2. Destroy the cluster

Expected behavior

Everything in the cluster should be destroyed and the destroy command should return without an error.

Actual behavior

As described above I get an error that the firewall is still using it. Using the console to delete the network manually works.

Version (ghpc --version)

[main] hpc-toolkit: ./ghpc --version
ghpc version v1.0.0

Blueprint

If applicable, attach or paste the blueprint YAML used to produce the bug.

blueprint_name: hpc-cluster-small

vars:
  project_id: climate-modeling-285614
  deployment_name: lopez
  region: us-central1
  zone: us-central1-c

deployment_groups:
- group: primary
  modules:
  # Source is an embedded module, denoted by "modules/*" without ./, ../, /
  # as a prefix. To refer to a local or community module, prefix with ./, ../ or /
  # Example - ./modules/network/vpc
  - source: modules/network/vpc
    kind: terraform
    id: network1

  - source: modules/file-system/filestore
    kind: terraform
    id: homefs
    use: [network1]
    settings:
      local_mount: /home

  - source: modules/compute/vm-instance
    kind: terraform
    id: compute
    use:
    - network1
    - homefs
    settings:
      instance_count: 8
      name_prefix: lopez-compute
      machine_type: c2-standard-60

Expanded Blueprint

If applicable, please attach or paste the expanded blueprint. The expanded blueprint can be obtained by running ghpc expand your-blueprint.yaml.

Disregard if the bug occurs when running ghpc expand ... as well.

blueprint_name: hpc-cluster-small
validators:
- validator: test_project_exists
  inputs:
    project_id: ((var.project_id))
- validator: test_region_exists
  inputs:
    project_id: ((var.project_id))
    region: ((var.region))
- validator: test_zone_exists
  inputs:
    project_id: ((var.project_id))
    zone: ((var.zone))
- validator: test_zone_in_region
  inputs:
    project_id: ((var.project_id))
    region: ((var.region))
    zone: ((var.zone))
validation_level: 1
vars:
  deployment_name: lopez
  labels:
    ghpc_blueprint: hpc-cluster-small
    ghpc_deployment: lopez
  project_id: climate-modeling-285614
  region: us-central1
  zone: us-central1-c
deployment_groups:
- group: primary
  terraform_backend:
    type: ""
    configuration: {}
  modules:
  - source: modules/network/vpc
    kind: terraform
    id: network1
    modulename: ""
    use: []
    wrapsettingswith: {}
    settings:
      deployment_name: ((var.deployment_name))
      project_id: ((var.project_id))
      region: ((var.region))
  - source: modules/file-system/filestore
    kind: terraform
    id: homefs
    modulename: ""
    use:
    - network1
    wrapsettingswith: {}
    settings:
      deployment_name: ((var.deployment_name))
      labels:
        ghpc_role: file-system
      local_mount: /home
      network_name: ((module.network1.network_name))
      project_id: ((var.project_id))
      region: ((var.region))
      zone: ((var.zone))
  - source: modules/compute/vm-instance
    kind: terraform
    id: compute
    modulename: ""
    use:
    - network1
    - homefs
    wrapsettingswith:
      network_storage:
      - flatten(
      - )
    settings:
      deployment_name: ((var.deployment_name))
      instance_count: 8
      labels:
        ghpc_role: compute
      machine_type: c2-standard-60
      name_prefix: lopez-compute
      network_self_link: ((module.network1.network_self_link))
      network_storage:
      - ((module.homefs.network_storage))
      project_id: ((var.project_id))
      subnetwork_self_link: ((module.network1.subnetwork_self_link))
      zone: ((var.zone))
  kind: terraform
terraform_backend_defaults:
  type: ""
  configuration: {}

Execution environment

  • OS: macOS
  • Shell: bash
  • go version: go version go1.18.3 darwin/arm64

Additional context

Add any other context about the problem here.

Adding partition causes the entire cluster to fail due to failures in `/slurm/scripts/setup.py`

Adding a partition kills existing nodes/renders them inoperable. This issue is specific to using ghpc with static nodes (pre-allocated fixed machines with long-lived jobs) rather than autoscaling nodes.

Repro is below, but what's happening is when I add a new partition, it causes /slurm/scripts/setup.py to be executed on existing nodes, which fails since it does things like mounting /home which must only be done once.

In turn, munged appears to end up in bad state as seen from logs below, hence whole cluster ends up inoperable.
Good pointer from @nick-stroud to look at munged logs

@cboneti

Here's a simple test case, create and launch a tiny cluster with a couple CPU nodes using this blueprint (yaml)

Now, I create a new partition (new yaml) and reconfigure

This kills existing jobs running on previous "tiny" partition and all nodes go into "IDLE+CLOUD+POWERED_DOWN+NOT_RESPONDING"

Munged logs

(base) yaroslav_contextual_ai@dcluster1-tiny-ghpc-0:/var/log$ sudo cat munge/munged.log
2023-07-06 00:35:08 +0000 Notice:    Running on "dcluster1-tiny-ghpc-0.c.contextual-research-common.internal" (10.0.0.95)
2023-07-06 00:35:08 +0000 Info:      PRNG seeded with 1024 bytes from "/dev/urandom"
2023-07-06 00:35:08 +0000 Error:     Failed to check keyfile "/etc/munge/munge.key": No such file or directory
2023-07-06 00:35:26 +0000 Notice:    Running on "dcluster1-tiny-ghpc-0.c.contextual-research-common.internal" (10.0.0.95)
2023-07-06 00:35:26 +0000 Info:      PRNG seeded with 1024 bytes from "/dev/urandom"
2023-07-06 00:35:26 +0000 Info:      Updating supplementary group mapping every 3600 seconds
2023-07-06 00:35:26 +0000 Info:      Enabled supplementary group mtime check of "/etc/group"
2023-07-06 00:35:26 +0000 Notice:    Starting munge-0.5.11 daemon (pid 1930)
2023-07-06 00:35:26 +0000 Info:      Created 10 work threads
2023-07-06 00:35:26 +0000 Info:      Found 1 user with supplementary groups in 0.004 seconds
2023-07-06 00:35:27 +0000 Info:      Unauthorized credential for client UID=0 GID=0
2023-07-06 00:35:27 +0000 Info:      Unauthorized credential for client UID=0 GID=0
2023-07-06 00:40:46 +0000 Info:      Unauthorized credential for client UID=0 GID=0
2023-07-06 00:41:09 +0000 Info:      Unauthorized credential for client UID=0 GID=0
2023-07-07 03:49:53 +0000 Notice:    Exiting on signal=15
2023-07-07 03:49:53 +0000 Info:      Wrote 1024 bytes to PRNG seed "/var/lib/munge/munge.seed"
2023-07-07 03:49:53 +0000 Notice:    Stopping munge-0.5.11 daemon (pid 1930)
2023-07-07 03:50:10 +0000 Notice:    Running on "dcluster1-tiny-ghpc-0.c.contextual-research-common.internal" (10.0.0.95)
2023-07-07 03:50:10 +0000 Info:      PRNG seeded with 1024 bytes from "/var/lib/munge/munge.seed"
2023-07-07 03:50:10 +0000 Info:      Updating supplementary group mapping every 3600 seconds
2023-07-07 03:50:10 +0000 Info:      Enabled supplementary group mtime check of "/etc/group"
2023-07-07 03:50:10 +0000 Notice:    Starting munge-0.5.11 daemon (pid 1194)
2023-07-07 03:50:10 +0000 Info:      Created 10 work threads
2023-07-07 03:50:10 +0000 Info:      Found 1 user with supplementary groups in 0.005 seconds
2023-07-07 03:50:12 +0000 Info:      Unauthorized credential for client UID=0 GID=0
2023-07-07 03:50:14 +0000 Info:      Unauthorized credential for client UID=0 GID=0
2023-07-07 03:55:49 +0000 Notice:    Exiting on signal=15
2023-07-07 03:55:49 +0000 Info:      Wrote 1024 bytes to PRNG seed "/var/lib/munge/munge.seed"
2023-07-07 03:55:49 +0000 Notice:    Stopping munge-0.5.11 daemon (pid 1194)
2023-07-07 03:56:06 +0000 Notice:    Running on "dcluster1-tiny-ghpc-0.c.contextual-research-common.internal" (10.0.0.95)
2023-07-07 03:56:06 +0000 Info:      PRNG seeded with 1024 bytes from "/var/lib/munge/munge.seed"
2023-07-07 03:56:06 +0000 Info:      Updating supplementary group mapping every 3600 seconds
2023-07-07 03:56:06 +0000 Info:      Enabled supplementary group mtime check of "/etc/group"
2023-07-07 03:56:06 +0000 Notice:    Starting munge-0.5.11 daemon (pid 1166)
2023-07-07 03:56:06 +0000 Info:      Created 10 work threads
2023-07-07 03:56:06 +0000 Info:      Found 1 user with supplementary groups in 0.013 seconds
2023-07-07 03:56:07 +0000 Info:      Unauthorized credential for client UID=0 GID=0
2023-07-07 03:56:09 +0000 Info:      Unauthorized credential for client UID=0 GID=0
2023-07-08 03:16:35 +0000 Info:      Unauthorized credential for client UID=0 GID=0
2023-07-08 03:25:57 +0000 Info:      Unauthorized credential for client UID=0 GID=0
2023-07-08 03:25:57 +0000 Info:      Unauthorized credential for client UID=0 GID=0
2023-07-08 03:51:48 +0000 Info:      Unauthorized credential for client UID=0 GID=0
2023-07-08 03:51:48 +0000 Info:      Unauthorized credential for client UID=0 GID=0
2023-07-08 04:02:17 +0000 Info:      Unauthorized credential for client UID=0 GID=0
2023-07-08 04:02:17 +0000 Info:      Unauthorized credential for client UID=0 GID=0

/var/log/messages

on one of the failing nodes in /var/log/messages I see

Starting google-cloud-metrics-agent...#011{"Version": "latest", "NumCPU": 1}                        [747/1977]
Jul  7 03:56:23 dcluster1-tiny-ghpc-0 otelopscol: 2023-07-07T03:56:23.758Z#011info#011extensions/extensions.go:
41#011Starting extensions...
Jul  7 03:56:23 dcluster1-tiny-ghpc-0 otelopscol: 2023-07-07T03:56:23.759Z#011info#011internal/resourcedetection.go:136#011began detecting resource information
Jul  7 03:56:23 dcluster1-tiny-ghpc-0 otelopscol: 2023-07-07T03:56:23.835Z#011info#011internal/resourcedetectio
n.go:150#011detected resource information#011{"resource": {"cloud.account.id":"contextual-research-common","clo
ud.availability_zone":"asia-southeast1-c","cloud.platform":"gcp_compute_engine","cloud.provider":"gcp","cloud.r
egion":"asia-southeast1","host.id":"6204437015061836493","host.name":"dcluster1-tiny-ghpc-0","host.type":"proje
cts/1048622315445/machineTypes/n2-standard-2"}}Jul  7 03:56:23 dcluster1-tiny-ghpc-0 otelopscol: 2023-07-07T03:56:23.859Z#011info#[email protected].
0/metrics_receiver.go:255#011Starting discovery manager
Jul  7 03:56:23 dcluster1-tiny-ghpc-0 otelopscol: 2023-07-07T03:56:23.868Z#011info#[email protected].
0/metrics_receiver.go:243#011Scrape job added#011{"jobName": "logging-collector"}
Jul  7 03:56:23 dcluster1-tiny-ghpc-0 otelopscol: 2023-07-07T03:56:23.869Z#011info#[email protected].
0/metrics_receiver.go:243#011Scrape job added#011{"jobName": "otel-collector"}
Jul  7 03:56:23 dcluster1-tiny-ghpc-0 otelopscol: 2023-07-07T03:56:23.869Z#011info#011service/service.go:145#01
1Everything is ready. Begin running and processing data.
Jul  7 03:56:23 dcluster1-tiny-ghpc-0 otelopscol: 2023-07-07T03:56:23.869Z#011info#[email protected].
0/metrics_receiver.go:289#011Starting scrape manager
Jul  7 03:56:23 dcluster1-tiny-ghpc-0 otelopscol: 2023-07-07T03:56:23.869Z#011info#[email protected].
0/metrics_receiver.go:255#011Starting discovery manager
Jul  7 03:56:23 dcluster1-tiny-ghpc-0 otelopscol: 2023-07-07T03:56:23.869Z#011info#[email protected].
0/metrics_receiver.go:289#011Starting scrape managerJul  7 03:56:24 dcluster1-tiny-ghpc-0 systemd: Created slice User Slice of yaroslav_contextual_ai.
Jul  7 03:56:24 dcluster1-tiny-ghpc-0 systemd: Started Session 1 of user yaroslav_contextual_ai.
Jul  7 03:56:24 dcluster1-tiny-ghpc-0 systemd-logind: New session 1 of user yaroslav_contextual_ai.
Jul  7 03:56:24 dcluster1-tiny-ghpc-0 slurmeventd.py: INFO: Listening for messages on 'projects/contextual-rese
arch-common/subscriptions/dcluster1-tiny-ghpc-0'...
Jul  7 03:56:26 dcluster1-tiny-ghpc-0 dkms: Deprecated feature: REMAKE_INITRD (/var/lib/dkms/lustre-client/2.12
.9/source/dkms.conf)Jul  7 03:56:27 dcluster1-tiny-ghpc-0 dkms: nvidia.ko.xz:
Jul  7 03:56:27 dcluster1-tiny-ghpc-0 dkms: Running module version sanity check.
Jul  7 03:56:27 dcluster1-tiny-ghpc-0 dkms: Module version 530.30.02 for nvidia.ko.xz
Jul  7 03:56:27 dcluster1-tiny-ghpc-0 dkms: exactly matches what is already found in kernel 3.10.0-1160.90.1.el
7.x86_64.Jul  7 03:56:27 dcluster1-tiny-ghpc-0 dkms: DKMS will not replace this module.
Jul  7 03:56:27 dcluster1-tiny-ghpc-0 dkms: You may override by specifying --force.
Jul  7 03:56:27 dcluster1-tiny-ghpc-0 dkms: nvidia-uvm.ko.xz:
Jul  7 03:56:27 dcluster1-tiny-ghpc-0 dkms: Running module version sanity check.
Jul  7 03:56:28 dcluster1-tiny-ghpc-0 dkms: Module version 530.30.02 for nvidia-uvm.ko.xz
Jul  7 03:56:28 dcluster1-tiny-ghpc-0 dkms: exactly matches what is already found in kernel 3.10.0-1160.90.1.el
7.x86_64.
Jul  7 03:56:28 dcluster1-tiny-ghpc-0 dkms: DKMS will not replace this module.
Jul  7 03:56:28 dcluster1-tiny-ghpc-0 dkms: You may override by specifying --force.
Jul  7 03:56:28 dcluster1-tiny-ghpc-0 dkms: nvidia-modeset.ko.xz:
Jul  7 03:56:28 dcluster1-tiny-ghpc-0 dkms: Running module version sanity check.
Jul  7 03:56:28 dcluster1-tiny-ghpc-0 dkms: Module version 530.30.02 for nvidia-modeset.ko.xz
Jul  7 03:56:28 dcluster1-tiny-ghpc-0 dkms: exactly matches what is already found in kernel 3.10.0-1160.90.1.el
7.x86_64.
Jul  7 03:56:28 dcluster1-tiny-ghpc-0 dkms: DKMS will not replace this module.Jul  7 03:56:28 dcluster1-tiny-ghpc-0 dkms: You may override by specifying --force.
Jul  7 03:56:28 dcluster1-tiny-ghpc-0 dkms: nvidia-peermem.ko.xz:
Jul  7 03:56:28 dcluster1-tiny-ghpc-0 dkms: Running module version sanity check.
Jul  7 03:56:28 dcluster1-tiny-ghpc-0 dkms: Module version 530.30.02 for nvidia-peermem.ko.xzJul  7 03:56:28 dcluster1-tiny-ghpc-0 dkms: exactly matches what is already found in kernel 3.10.0-1160.90.1.el
7.x86_64.
Jul  7 03:56:28 dcluster1-tiny-ghpc-0 dkms: DKMS will not replace this module.
Jul  7 03:56:28 dcluster1-tiny-ghpc-0 dkms: You may override by specifying --force.
Jul  7 03:56:28 dcluster1-tiny-ghpc-0 dkms: Error! Installation aborted.
Jul  7 03:56:28 dcluster1-tiny-ghpc-0 dkms: dkms autoinstall on 3.10.0-1160.90.1.el7.x86_64/x86_64 succeeded fo
r gve lustre-client
Jul  7 03:56:28 dcluster1-tiny-ghpc-0 dkms: dkms autoinstall on 3.10.0-1160.90.1.el7.x86_64/x86_64 failed for n
vidia(6)
Jul  7 03:56:28 dcluster1-tiny-ghpc-0 dkms: Error! One or more modules failed to install during autoinstall.
Jul  7 03:56:28 dcluster1-tiny-ghpc-0 dkms: Refer to previous errors for more information.
Jul  7 03:56:28 dcluster1-tiny-ghpc-0 systemd: dkms.service: main process exited, code=exited, status=11/n/aJul  7 03:56:28 dcluster1-tiny-ghpc-0 systemd: Failed to start Builds and install new kernel modules through DK
MS.
Jul  7 03:56:28 dcluster1-tiny-ghpc-0 systemd: Unit dkms.service entered failed state.
Jul  7 03:56:28 dcluster1-tiny-ghpc-0 systemd: dkms.service failed.Jul  7 03:56:28 dcluster1-tiny-ghpc-0 systemd: Reached target Multi-User System.
Jul  7 03:56:28 dcluster1-tiny-ghpc-0 systemd: Starting Update UTMP about System Runlevel Changes...
Jul  7 03:56:28 dcluster1-tiny-ghpc-0 systemd: Started Update UTMP about System Runlevel Changes.
Jul  7 03:56:28 dcluster1-tiny-ghpc-0 systemd: Startup finished in 706ms (kernel) + 1.666s (initrd) + 31.271s (
userspace) = 33.643s.
Jul  7 03:56:54 dcluster1-tiny-ghpc-0 wall[2282]: wall: user root broadcasted 1 lines (64 chars)
Jul  7 03:58:01 dcluster1-tiny-ghpc-0 systemd: Created slice User Slice of root.Jul  7 03:58:01 dcluster1-tiny-ghpc-0 systemd: Started Session 2 of user root.Jul  7 03:58:01 dcluster1-tiny-ghpc-0 systemd: Removed slice User Slice of root.
Jul  7 03:58:31 dcluster1-tiny-ghpc-0 wall[2393]: wall: user root broadcasted 1 lines (64 chars)
Jul  7 04:00:01 dcluster1-tiny-ghpc-0 systemd: Created slice User Slice of root.
Jul  7 04:00:01 dcluster1-tiny-ghpc-0 systemd: Started Session 3 of user root.
Jul  7 04:00:01 dcluster1-tiny-ghpc-0 systemd: Removed slice User Slice of root.
Jul  7 04:00:42 dcluster1-tiny-ghpc-0 systemd-logind: Removed session 1.
[yaroslav_ 0:[tmux]*                                                    "dcluster1-tiny-ghpc-0" 05:19 08-Jul-23

.....

base) yaroslav_contextual_ai@dcluster1-tiny-ghpc-0:/var/log$ sudo tail -50 messages                           
Jul  8 04:10:01 dcluster1-tiny-ghpc-0 systemd: Removed slice User Slice of root.
Jul  8 04:11:05 dcluster1-tiny-ghpc-0 systemd: Starting Cleanup of Temporary Directories...
Jul  8 04:11:05 dcluster1-tiny-ghpc-0 systemd: Started Cleanup of Temporary Directories.
Jul  8 04:12:01 dcluster1-tiny-ghpc-0 systemd: Created slice User Slice of root.
Jul  8 04:12:01 dcluster1-tiny-ghpc-0 systemd: Started Session 754 of user root.
Jul  8 04:12:01 dcluster1-tiny-ghpc-0 systemd: Removed slice User Slice of root.
Jul  8 04:14:01 dcluster1-tiny-ghpc-0 systemd: Created slice User Slice of root.
Jul  8 04:14:01 dcluster1-tiny-ghpc-0 systemd: Started Session 755 of user root.
Jul  8 04:14:01 dcluster1-tiny-ghpc-0 systemd: Removed slice User Slice of root.
Jul  8 04:16:01 dcluster1-tiny-ghpc-0 systemd: Created slice User Slice of root.
Jul  8 04:16:01 dcluster1-tiny-ghpc-0 systemd: Started Session 756 of user root.
Jul  8 04:16:01 dcluster1-tiny-ghpc-0 systemd: Removed slice User Slice of root.
Jul  8 04:17:05 dcluster1-tiny-ghpc-0 systemd: Starting GCE Workload Certificate refresh...
Jul  8 04:17:05 dcluster1-tiny-ghpc-0 gce_workload_cert_refresh: 2023/07/08 04:17:05: Error getting config status, workload certificates may not be configured: HTTP 404
Jul  8 04:17:05 dcluster1-tiny-ghpc-0 gce_workload_cert_refresh: 2023/07/08 04:17:05: DoneJul  8 04:17:05 dcluster1-tiny-ghpc-0 systemd: Started GCE Workload Certificate refresh.
Jul  8 04:18:01 dcluster1-tiny-ghpc-0 systemd: Created slice User Slice of root.
Jul  8 04:18:01 dcluster1-tiny-ghpc-0 systemd: Started Session 757 of user root.
Jul  8 04:18:01 dcluster1-tiny-ghpc-0 systemd: Removed slice User Slice of root.
Jul  8 04:19:58 dcluster1-tiny-ghpc-0 systemd: Created slice User Slice of yaroslav_contextual_ai.
Jul  8 04:19:58 dcluster1-tiny-ghpc-0 systemd: Started Session 758 of user yaroslav_contextual_ai.
Jul  8 04:19:58 dcluster1-tiny-ghpc-0 systemd-logind: New session 758 of user yaroslav_contextual_ai.
Jul  8 04:20:01 dcluster1-tiny-ghpc-0 systemd: Created slice User Slice of root.
Jul  8 04:20:01 dcluster1-tiny-ghpc-0 systemd: Started Session 759 of user root.
Jul  8 04:20:01 dcluster1-tiny-ghpc-0 systemd: Removed slice User Slice of root.
Jul  8 04:22:01 dcluster1-tiny-ghpc-0 systemd: Created slice User Slice of root.
Jul  8 04:22:01 dcluster1-tiny-ghpc-0 systemd: Started Session 760 of user root.
Jul  8 04:22:01 dcluster1-tiny-ghpc-0 systemd: Removed slice User Slice of root.
Jul  8 04:24:01 dcluster1-tiny-ghpc-0 systemd: Created slice User Slice of root.
Jul  8 04:24:01 dcluster1-tiny-ghpc-0 systemd: Started Session 761 of user root.
Jul  8 04:24:01 dcluster1-tiny-ghpc-0 systemd: Removed slice User Slice of root.
Jul  8 04:26:01 dcluster1-tiny-ghpc-0 systemd: Created slice User Slice of root.
Jul  8 04:26:01 dcluster1-tiny-ghpc-0 systemd: Started Session 762 of user root.
Jul  8 04:26:01 dcluster1-tiny-ghpc-0 systemd: Removed slice User Slice of root.
Jul  8 04:28:01 dcluster1-tiny-ghpc-0 systemd: Starting GCE Workload Certificate refresh...
Jul  8 04:28:01 dcluster1-tiny-ghpc-0 systemd: Created slice User Slice of root.
Jul  8 04:28:01 dcluster1-tiny-ghpc-0 systemd: Started Session 763 of user root.
Jul  8 04:28:01 dcluster1-tiny-ghpc-0 systemd: Removed slice User Slice of root.
Jul  8 04:28:01 dcluster1-tiny-ghpc-0 gce_workload_cert_refresh: 2023/07/08 04:28:01: Error getting config status, workload certificates may not be configured: HTTP 404
Jul  8 04:28:01 dcluster1-tiny-ghpc-0 gce_workload_cert_refresh: 2023/07/08 04:28:01: Done
Jul  8 04:28:01 dcluster1-tiny-ghpc-0 systemd: Started GCE Workload Certificate refresh.
Jul  8 04:30:01 dcluster1-tiny-ghpc-0 systemd: Created slice User Slice of root.
Jul  8 04:30:01 dcluster1-tiny-ghpc-0 systemd: Started Session 764 of user root.
Jul  8 04:30:01 dcluster1-tiny-ghpc-0 systemd: Removed slice User Slice of root.
Jul  8 04:32:02 dcluster1-tiny-ghpc-0 systemd: Created slice User Slice of root.
Jul  8 04:32:02 dcluster1-tiny-ghpc-0 systemd: Started Session 765 of user root.
Jul  8 04:32:02 dcluster1-tiny-ghpc-0 systemd: Removed slice User Slice of root.
Jul  8 04:34:01 dcluster1-tiny-ghpc-0 systemd: Created slice User Slice of root.
Jul  8 04:34:01 dcluster1-tiny-ghpc-0 systemd: Started Session 766 of user root.
Jul  8 04:34:01 dcluster1-tiny-ghpc-0 systemd: Removed slice User Slice of root.

Slurm ctld logs

2023-07-08T03:51:46.081] Slurmd shutdown completing
[2023-07-08T03:51:48.106] CPU frequency setting not configured for this node
[2023-07-08T03:51:48.109] slurmd version 22.05.9 started
[2023-07-08T03:51:48.111] slurmd started on Sat, 08 Jul 2023 03:51:48 +0000
[2023-07-08T03:51:48.112] CPUs=1 Boards=1 Sockets=1 Cores=1 Threads=1 Memory=7818 TmpDisk=122668 Uptime=86154 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2023-07-08T04:02:14.344] Slurmd shutdown completing
[2023-07-08T04:02:17.370] CPU frequency setting not configured for this node
[2023-07-08T04:02:17.373] slurmd version 22.05.9 started
[2023-07-08T04:02:17.375] slurmd started on Sat, 08 Jul 2023 04:02:17 +0000
[2023-07-08T04:02:17.376] CPUs=1 Boards=1 Sockets=1 Cores=1 Threads=1 Memory=7818 TmpDisk=122668 Uptime=86783 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2023-07-08T04:09:36.069] Slurmd shutdown completing

slurm setup.log

2023-07-06 00:35:27,549 INFO: Done setting up compute2023-07-07 03:56:54,245 DEBUG: get_metadata: metadata not found (http://metadata.google.internal/computeMetadata/v1/project/attributes/dcluster1-slurm-devel)
2023-07-07 03:56:54,246 DEBUG: fetch_devel_scripts: scripts not found in project metadata, devel mode not enabled2023-07-07 03:56:54,251 INFO: Setting up compute
2023-07-07 03:56:54,257 INFO: installing custom scripts: compute.d/ghpc_startup.sh,partition.d/tiny/ghpc_startup.sh2023-07-07 03:56:54,259 DEBUG: install_custom_scripts: compute.d/ghpc_startup.sh
2023-07-07 03:56:54,272 DEBUG: install_custom_scripts: partition.d/tiny/ghpc_startup.sh2023-07-07 03:56:54,322 INFO: Set up network storage
2023-07-07 03:56:54,332 INFO: Setting up mount (nfs) 10.36.95.66:/nfsshare to /home
2023-07-07 03:56:54,333 INFO: Setting up mount (nfs) dcluster1-controller:/opt/apps to /opt/apps2023-07-07 03:56:54,833 INFO: Waiting for '/home' to be mounted...
2023-07-07 03:56:54,837 INFO: Waiting for '/opt/apps' to be mounted...
2023-07-07 03:56:54,925 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-07-07 03:56:54,948 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '
['mount', '/home']' returned non-zero exit status 32.2023-07-07 03:56:55,926 INFO: Waiting for '/opt/apps' to be mounted...
2023-07-07 03:56:55,944 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Comma
nd '['mount', '/opt/apps']' returned non-zero exit status 32.2023-07-07 03:56:55,950 INFO: Waiting for '/home' to be mounted...
2023-07-07 03:56:56,042 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '
['mount', '/home']' returned non-zero exit status 32.2023-07-07 03:56:57,545 INFO: Waiting for '/opt/apps' to be mounted...
2023-07-07 03:56:57,561 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Comma
nd '['mount', '/opt/apps']' returned non-zero exit status 32.2023-07-07 03:56:57,644 INFO: Waiting for '/home' to be mounted...
2023-07-07 03:56:57,732 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '
['mount', '/home']' returned non-zero exit status 32.2023-07-07 03:57:00,122 INFO: Waiting for '/opt/apps' to be mounted...
2023-07-07 03:57:00,138 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Comma
2023-07-07 03:57:00,138 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.2023-07-07 03:57:00,293 INFO: Waiting for '/home' to be mounted...
2023-07-07 03:57:00,381 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.2023-07-07 03:57:04,236 INFO: Waiting for '/opt/apps' to be mounted...
2023-07-07 03:57:04,253 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-07-07 03:57:04,479 INFO: Waiting for '/home' to be mounted...2023-07-07 03:57:04,563 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-07-07 03:57:10,808 INFO: Waiting for '/opt/apps' to be mounted...2023-07-07 03:57:10,825 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-07-07 03:57:11,118 INFO: Waiting for '/home' to be mounted...2023-07-07 03:57:11,227 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-07-07 03:57:21,313 INFO: Waiting for '/opt/apps' to be mounted...
2023-07-07 03:57:21,332 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-07-07 03:57:21,714 INFO: Waiting for '/home' to be mounted...2023-07-07 03:57:21,812 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.2023-07-07 03:57:37,333 INFO: Waiting for '/opt/apps' to be mounted...2023-07-07 03:57:37,352 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.2023-07-07 03:57:37,813 INFO: Waiting for '/home' to be mounted...
2023-07-07 03:57:37,900 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.2023-07-07 03:58:31,411 DEBUG: get_metadata: metadata not found (http://metadata.google.internal/computeMetadata/v1/project/attributes/dcluster1-slurm-devel)
2023-07-07 03:58:31,412 DEBUG: fetch_devel_scripts: scripts not found in project metadata, devel mode not enabled
2023-07-07 03:58:31,416 INFO: Setting up compute
2023-07-07 03:58:31,421 INFO: installing custom scripts: compute.d/ghpc_startup.sh,partition.d/tiny/ghpc_startup.sh
2023-07-07 03:58:31,422 DEBUG: install_custom_scripts: compute.d/ghpc_startup.sh
2023-07-07 03:58:31,426 DEBUG: install_custom_scripts: partition.d/tiny/ghpc_startup.sh2023-07-07 03:58:31,438 INFO: Set up network storage
2023-07-07 03:58:31,451 INFO: Setting up mount (nfs) 10.36.95.66:/nfsshare to /home
2023-07-07 03:58:31,451 INFO: Setting up mount (nfs) dcluster1-controller:/opt/apps to /opt/apps2023-07-07 03:58:31,505 INFO: Waiting for '/home' to be mounted...
2023-07-07 03:58:31,508 INFO: Waiting for '/opt/apps' to be mounted...
2023-07-07 03:58:31,542 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-07-07 03:58:31,606 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.2023-07-07 03:58:32,544 INFO: Waiting for '/opt/apps' to be mounted...
2023-07-07 03:58:32,561 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.2023-07-07 03:58:32,607 INFO: Waiting for '/home' to be mounted...
2023-07-07 03:58:32,702 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.2023-07-07 03:58:34,162 INFO: Waiting for '/opt/apps' to be mounted...
2023-07-07 03:58:34,180 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.2023-07-07 03:58:34,304 INFO: Waiting for '/home' to be mounted...
2023-07-07 03:58:34,387 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.2023-07-07 03:58:36,741 INFO: Waiting for '/opt/apps' to be mounted...
2023-07-07 03:58:36,758 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-07-07 03:58:36,949 INFO: Waiting for '/home' to be mounted...
2023-07-07 03:58:37,063 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-07-07 03:58:40,855 INFO: Waiting for '/opt/apps' to be mounted...
2023-07-07 03:58:40,872 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-07-07 03:58:41,161 INFO: Waiting for '/home' to be mounted...
2023-07-07 03:58:41,270 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-07-07 03:58:47,427 INFO: Waiting for '/opt/apps' to be mounted...
2023-07-07 03:58:47,445 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-07-07 03:58:47,825 INFO: Waiting for '/home' to be mounted...
2023-07-07 03:58:47,920 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-07-07 03:58:57,932 INFO: Waiting for '/opt/apps' to be mounted...
2023-07-07 03:58:57,950 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-07-07 03:58:58,408 INFO: Waiting for '/home' to be mounted...
2023-07-07 03:58:58,537 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
(base) yaroslav_contextual_ai@dcluster1-tiny-ghpc-0:/var/log$ 

[yaroslav_ 0:bash*                                                      "dcluster1-tiny-ghpc-0" 04:41 08-Jul-23
2023-07-07 03:57:00,138 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.2023-07-07 03:57:00,293 INFO: Waiting for '/home' to be mounted...
2023-07-07 03:57:00,381 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.2023-07-07 03:57:04,236 INFO: Waiting for '/opt/apps' to be mounted...
2023-07-07 03:57:04,253 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-07-07 03:57:04,479 INFO: Waiting for '/home' to be mounted...2023-07-07 03:57:04,563 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-07-07 03:57:10,808 INFO: Waiting for '/opt/apps' to be mounted...2023-07-07 03:57:10,825 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-07-07 03:57:11,118 INFO: Waiting for '/home' to be mounted...2023-07-07 03:57:11,227 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-07-07 03:57:21,313 INFO: Waiting for '/opt/apps' to be mounted...
2023-07-07 03:57:21,332 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-07-07 03:57:21,714 INFO: Waiting for '/home' to be mounted...2023-07-07 03:57:21,812 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.2023-07-07 03:57:37,333 INFO: Waiting for '/opt/apps' to be mounted...2023-07-07 03:57:37,352 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.2023-07-07 03:57:37,813 INFO: Waiting for '/home' to be mounted...
2023-07-07 03:57:37,900 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.2023-07-07 03:58:31,411 DEBUG: get_metadata: metadata not found (http://metadata.google.internal/computeMetadata/v1/project/attributes/dcluster1-slurm-devel)
2023-07-07 03:58:31,412 DEBUG: fetch_devel_scripts: scripts not found in project metadata, devel mode not enabled
2023-07-07 03:58:31,416 INFO: Setting up compute
2023-07-07 03:58:31,421 INFO: installing custom scripts: compute.d/ghpc_startup.sh,partition.d/tiny/ghpc_startup.sh
2023-07-07 03:58:31,422 DEBUG: install_custom_scripts: compute.d/ghpc_startup.sh
2023-07-07 03:58:31,426 DEBUG: install_custom_scripts: partition.d/tiny/ghpc_startup.sh2023-07-07 03:58:31,438 INFO: Set up network storage
2023-07-07 03:58:31,451 INFO: Setting up mount (nfs) 10.36.95.66:/nfsshare to /home
2023-07-07 03:58:31,451 INFO: Setting up mount (nfs) dcluster1-controller:/opt/apps to /opt/apps2023-07-07 03:58:31,505 INFO: Waiting for '/home' to be mounted...
2023-07-07 03:58:31,508 INFO: Waiting for '/opt/apps' to be mounted...
2023-07-07 03:58:31,542 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-07-07 03:58:31,606 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.2023-07-07 03:58:32,544 INFO: Waiting for '/opt/apps' to be mounted...
2023-07-07 03:58:32,561 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.2023-07-07 03:58:32,607 INFO: Waiting for '/home' to be mounted...
2023-07-07 03:58:32,702 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.2023-07-07 03:58:34,162 INFO: Waiting for '/opt/apps' to be mounted...
2023-07-07 03:58:34,180 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.2023-07-07 03:58:34,304 INFO: Waiting for '/home' to be mounted...
2023-07-07 03:58:34,387 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.2023-07-07 03:58:36,741 INFO: Waiting for '/opt/apps' to be mounted...
2023-07-07 03:58:36,758 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-07-07 03:58:36,949 INFO: Waiting for '/home' to be mounted...
2023-07-07 03:58:37,063 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-07-07 03:58:40,855 INFO: Waiting for '/opt/apps' to be mounted...
2023-07-07 03:58:40,872 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-07-07 03:58:41,161 INFO: Waiting for '/home' to be mounted...
2023-07-07 03:58:41,270 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-07-07 03:58:47,427 INFO: Waiting for '/opt/apps' to be mounted...
2023-07-07 03:58:47,445 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-07-07 03:58:47,825 INFO: Waiting for '/home' to be mounted...
2023-07-07 03:58:47,920 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-07-07 03:58:57,932 INFO: Waiting for '/opt/apps' to be mounted...
2023-07-07 03:58:57,950 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-07-07 03:58:58,408 INFO: Waiting for '/home' to be mounted...
2023-07-07 03:58:58,537 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
(base) yaroslav_contextual_ai@dcluster1-tiny-ghpc-0:/var/log$ 

[yaroslav_ 0:bash*                                                      "dcluster1-tiny-ghpc-0" 04:41 08-Jul-23

Packer custom image is missing storage-location option.

I am building a custom image for my HPC cluster deployment. A resource creation outside certain zones is restricted in my project. Therefore, I need to be able to choose a location where to store the image.

In GCP GUI console, I can specify it like in the figure below.

image

But, if I create an image using modules/packer/custom-image module, it defaults to Multi-regional and there is no way to choose a specific region. Therefore, I run into the following error:

==> project-cluster-images.googlecompute.toolkit_image: Startup script, if any, has finished running.
==> project-cluster-images.googlecompute.toolkit_image: Deleting instance...
    project-cluster-images.googlecompute.toolkit_image: Instance has been deleted!
==> project-cluster-images.googlecompute.toolkit_image: Creating image...
==> project-cluster-images.googlecompute.toolkit_image: Error waiting for image: googleapi: Error 412: Location us violates constraint constraints/gcp.resourceLocations on the resource projects/project/global/images/project-compute-image-20230405t110838z., conditionNotMet
==> project-cluster-images.googlecompute.toolkit_image: Deleting disk...
    project-cluster-images.googlecompute.toolkit_image: Disk has been deleted!
==> project-cluster-images.googlecompute.toolkit_image: Provisioning step had errors: Running the cleanup provisioner, if present...
Build 'project-cluster-images.googlecompute.toolkit_image' errored after 10 minutes 54 seconds: Error waiting for image: googleapi: Error 412: Location us violates constraint constraints/gcp.resourceLocations on the resource projects/project/global/images/project-compute-image-20230405t110838z., conditionNotMet

Snippet from my blueprint:

- group: packer-compute
  modules:
  - id: custom-compute-image
    source: modules/packer/custom-image
    kind: packer
    settings:
      source_image_project_id: [schedmd-slurm-public]
      source_image_family: schedmd-v5-slurm-22-05-8-hpc-centos-7
      disk_size: 50
      image_family: $(vars.compute_image)
      state_timeout: 15m
      zone: $(vars.zone)
      subnetwork_name: $(vars.subnetwork_name)

CLI gcloud example:

gcloud beta compute machine-images create image --project=project --source-instance=vm --source-instance-zone=northamerica-northeast1-c --storage-location=northamerica-northeast1

Would it be possible to add an equivalent of --storage-location to modules/packer/custom-image module?

Ubuntu Slurm v5 deploys with CentOS

Describe the bug

This blueprint deploys with centos instead of ubuntu 20.04:
https://github.com/GoogleCloudPlatform/hpc-toolkit/blob/v1.9.0/community/examples/slurm-gcp-v5-ubuntu2004.yaml

I'm trying to setup an HPC cluster with Slurm on Ubuntu and DDN Exascaler. Whenever I deploy my blueprint, a centos image is used instead of schedmd-v5-slurm-22-05-4-ubuntu-2004-lts. I verified that I cannot even deploy the example ubuntu slurm cluster in a brand new GCP project and with a newly built ghcp binary.

Steps to reproduce

Steps to reproduce the behavior:

  1. make the ghpc binary
git clone [email protected]:GoogleCloudPlatform/hpc-toolkit.git
cd hpc-toolkit
make
./ghpc --version
./ghpc --help
  1. download this blueprint
curl https://raw.githubusercontent.com/GoogleCloudPlatform/hpc-toolkit/main/community/examples/slurm-gcp-v5-ubuntu2004.yaml > slurm-gcp-v5-ubuntu2004.yaml
  1. deploy the blueprint
./ghpc create slurm-gcp-v5-ubuntu2004.yaml --vars project_id=<PROJECT_ID>

Expected behavior

The resulting slurm cluster will use this image: schedmd-v5-slurm-22-05-4-ubuntu-2004-lts, and will function properly.

Actual behavior

A centos image is used, I believe the default schedmd-v5-slurm-22-05-2-hpc-centos-7 but I deleted the infra already. The slurm configuration is also stuck in a failed state.

Version (ghpc --1.9.0)

Execution environment

  • OS: macOS
  • Shell bash
  • go version: go version go1.19.3 darwin/amd64

Creating router and NAT in pre existing vpc

Hi

I am trying to deploy SLURM in pre-existing VPC and my vpc module looks like provided hereafter.

module "network1" {
source = "./modules/embedded/modules/network/vpc"
deployment_name = var.deployment_name
project_id = var.subnetwork_project
region = var.region
network_name = var.network_name
subnetwork_name = var.subnetwork_name
}

Our GCP network setup is already in place and we are using BGP and NAT. But it is creating a router and NAT which will create a conflict in the existing infrastructure.

How can I disable from creating any network resource in the deployment.

Regards

Private static IP address won't set on the Slurm controller node.

Hello,

I am building a Slurm cluster and experiencing an issue when trying to set a static private IP address for my controller node. I have a static private IP address reserved in my VPC Subnet (172.17.140.40). In my HPC Toolkit blueprint, I am assigning this reserved IP address by defining network_ip setting like in the example below.

  - source: community/modules/scheduler/schedmd-slurm-gcp-v5-controller
    kind: terraform
    id: slurm_controller
    settings:
      machine_type: e2-standard-8
      instance_image:
        family: $(vars.controller_image)
        project: $(vars.project_id)
      enable_oslogin: False
      disk_type: pd-standard
      disk_size_gb: 50
      subnetwork_self_link: $(vars.subnet_name)
      network_ip: 172.17.140.40
      tags: ["iap"]
      service_account:
        email: $(vars.sa_id)
        scopes:
        - https://www.googleapis.com/auth/cloud-platform
        - https://www.googleapis.com/auth/monitoring.write
        - https://www.googleapis.com/auth/logging.write
        - https://www.googleapis.com/auth/devstorage.read_write
    use:
    - hpc_network
    - partition_0
    - homefs

If I look at the generated Terraform code, I can see IP address reflecting in the main.tf

module "slurm_controller" {
  source                    = "./modules/embedded/community/modules/scheduler/schedmd-slurm-gcp-v5-controller"
  deployment_name           = var.deployment_name
  disk_size_gb              = 50
  disk_type                 = "pd-standard"
  enable_oslogin            = false
  instance_image = {
    family  = var.controller_image
    project = var.project_id
  }
  labels            = merge(var.labels, { ghpc_role = "scheduler", })
  machine_type      = "e2-standard-8"
  network_ip        = "172.17.140.40"
  network_self_link = module.hpc_network.network_self_link
  network_storage   = flatten([module.homefs.network_storage, module.optfs.network_storage])
  partition         = flatten([module.partition_0.partition, module.partition_1.partition, module.partition_2.partition])
  project_id        = var.project_id
  region            = var.region
  service_account = {
    email  = var.sa_id
    scopes = ["https://www.googleapis.com/auth/cloud-platform", "https://www.googleapis.com/auth/monitoring.write", "https://www.googleapis.com/auth/logging.write", "https://www.googleapis.com/auth/devstorage.read_
write"]
  }
  subnetwork_self_link = var.subnet_name
  tags                 = ["iap"]
  zone                 = var.zone
}

Unfortunately, after node boots, it does not get the static IP address assigned to it. Instead, it gets automatically generated IP.
I think the problem is that https://github.com/SchedMD/slurm-gcp/tree/5.6.2/terraform/slurm_cluster/modules/slurm_controller_instance assigns a static IP address to the node template here https://github.com/SchedMD/slurm-gcp/blob/5.6.2/terraform/slurm_cluster/modules/slurm_instance_template/main.tf#L90 instead of the node itself.

Forgive me if I am using the module incorrectly and if this is expected behaviour, if not - is this a bug in the module? It kind of makes impossible to set static IP address for the controller node. Would you be able to have a look at this?

In the gcloud compute instances create I would use argument such as --network-interface=network-tier=PREMIUM,private-network-ip=172.17.140.40,stack-type=IPV4_ONLY,subnet=subnet

Thank you!

Feature Request: Enable exposing certain Terraform values as outputs

It would be nice if one could specify in the YAML certain values to be included in root-level Terraform outputs.tf.

I'm thinking of cases where one stands up a Slurm cluster with a controller and one or more Login Nodes. Having the public (and private) IP addresses of those nodes exposed as Terraform Outputs would enable a user to quickly know that information, rather than having them then need to look up that information buried deep in the TFState file, or to know the instance names to query from Google.

One way that this could potentially be done is to introduce a new top-level YAML stanza of outputs, and list various resource outputs there.

The hpc-cluster-small.yaml example could then have this in it (It would require the Slurm resources to add a new output variable called host_info):

outputs:
- output: $(slurm_controller.host_info)
- output: $(slurm_login.host_info)

The top-level Terraform outputs.tf file could then include something like:

output "slurm_controller" {
    value = slurm_controller.network_interface[0].access_config[0].nat_ip
}
output "slurm_login" {
    value = slurm_login.network_interface[0].access_config[0].nat_ip
}

TKFE Deployed Cluster fails to initialize slurm

Describe the bug

Slurm fails to initialize on cluster deployed with TKFE

Steps to reproduce

Steps to reproduce the behavior:

  1. Deploy the frontend following instructions in the admin guide.
  2. Configure the frontend according to the instructions
  3. Create a cluster via the web frontend

Expected behavior

I would expect the cluster to be fully configured the same way that the clusters created with blueprints would be.

Actual behavior

Terraform deploys correctly, the login and controller nodes are created but slurm fails to initialize. "By default each cluster creates two shared filesystems: one at /opt/cluster to hold installed applications and one at /home to hold job files for individual users." These default filestore instances are not created and slurm fails to initialize during initialization.

Version (ghpc --version)

Blueprint

If applicable, attach or paste the blueprint YAML used to produce the bug.

N/a

Expanded Blueprint

If applicable, please attach or paste the expanded blueprint. The expanded blueprint can be obtained by running ghpc expand your-blueprint.yaml.

Disregard if the bug occurs when running ghpc expand ... as well.

N/a

Output and logs

Slurm log /slurm/scripts/setup.log on login node

2023-04-17 22:39:48,078 DEBUG: fetch_devel_scripts: scripts not found in project metadata, devel mode not enabled
2023-04-17 22:39:48,082 INFO: Setting up login
2023-04-17 22:39:48,088 INFO: installing custom scripts: login_4nx80qw2.d/ghpc_startup.sh
2023-04-17 22:39:48,089 DEBUG: install_custom_scripts: login_4nx80qw2.d/ghpc_startup.sh
2023-04-17 22:39:48,092 INFO: Set up network storage
2023-04-17 22:39:48,129 INFO: Setting up mount (nfs) alpha9ee24-controller:/home to /home
2023-04-17 22:39:48,130 INFO: Setting up mount (nfs) alpha9ee24-controller:/opt/apps to /opt/apps
2023-04-17 22:39:48,130 INFO: Setting up mount (nfs) alpha9ee24-controller:/opt/cluster to /opt/cluster
2023-04-17 22:39:48,172 DEBUG: <module>: disabling prometheus support
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/more_executors/_impl/metrics/__init__.py", line 18, in <module>
    from .prometheus import PrometheusMetrics
  File "/usr/local/lib/python3.6/site-packages/more_executors/_impl/metrics/prometheus.py", line 3, in <module>
    import prometheus_client  # pylint: disable=import-error
ModuleNotFoundError: No module named 'prometheus_client'
2023-04-17 22:39:48,383 INFO: Waiting for '/home' to be mounted...
2023-04-17 22:39:48,386 INFO: Waiting for '/opt/apps' to be mounted...
2023-04-17 22:39:48,392 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:39:48,602 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:39:48,605 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-04-17 22:39:48,607 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-04-17 22:39:49,603 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:39:49,607 INFO: Waiting for '/opt/apps' to be mounted...
2023-04-17 22:39:49,611 INFO: Waiting for '/home' to be mounted...
2023-04-17 22:39:49,630 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:39:49,636 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-04-17 22:39:49,643 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-04-17 22:39:51,231 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:39:51,237 INFO: Waiting for '/home' to be mounted...
2023-04-17 22:39:51,243 INFO: Waiting for '/opt/apps' to be mounted...
2023-04-17 22:39:51,256 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:39:51,262 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-04-17 22:39:51,268 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-04-17 22:39:53,817 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:39:53,822 INFO: Waiting for '/home' to be mounted...
2023-04-17 22:39:53,828 INFO: Waiting for '/opt/apps' to be mounted...
2023-04-17 22:39:53,842 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:39:53,849 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-04-17 22:39:53,855 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-04-17 22:39:57,939 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:39:57,946 INFO: Waiting for '/home' to be mounted...
2023-04-17 22:39:57,952 INFO: Waiting for '/opt/apps' to be mounted...
2023-04-17 22:39:57,965 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:39:58,973 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-04-17 22:39:58,980 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-04-17 22:40:04,519 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:40:04,538 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:40:05,528 INFO: Waiting for '/home' to be mounted...
2023-04-17 22:40:05,535 INFO: Waiting for '/opt/apps' to be mounted...
2023-04-17 22:40:05,549 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-04-17 22:40:05,555 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-04-17 22:40:15,025 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:40:15,043 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:40:16,035 INFO: Waiting for '/home' to be mounted...
2023-04-17 22:40:16,041 INFO: Waiting for '/opt/apps' to be mounted...
2023-04-17 22:40:16,055 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-04-17 22:40:16,062 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-04-17 22:40:31,044 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:40:31,066 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:40:32,056 INFO: Waiting for '/home' to be mounted...
2023-04-17 22:40:32,063 INFO: Waiting for '/opt/apps' to be mounted...
2023-04-17 22:40:32,076 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-04-17 22:40:32,083 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-04-17 22:40:47,067 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:40:47,086 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:40:48,077 INFO: Waiting for '/home' to be mounted...
2023-04-17 22:40:48,083 INFO: Waiting for '/opt/apps' to be mounted...
2023-04-17 22:40:48,097 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-04-17 22:40:48,104 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-04-17 22:41:03,087 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:41:03,108 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:41:04,098 INFO: Waiting for '/home' to be mounted...
2023-04-17 22:41:04,104 INFO: Waiting for '/opt/apps' to be mounted...
2023-04-17 22:41:04,122 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-04-17 22:41:04,129 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-04-17 22:41:19,109 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:41:19,130 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:41:20,123 INFO: Waiting for '/home' to be mounted...
2023-04-17 22:41:20,130 INFO: Waiting for '/opt/apps' to be mounted...
2023-04-17 22:41:20,143 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-04-17 22:41:20,150 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-04-17 22:41:35,130 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:41:35,151 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:41:36,143 INFO: Waiting for '/home' to be mounted...
2023-04-17 22:41:36,150 INFO: Waiting for '/opt/apps' to be mounted...
2023-04-17 22:41:36,165 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-04-17 22:41:36,172 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-04-17 22:41:51,152 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:41:51,171 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:41:52,166 INFO: Waiting for '/home' to be mounted...
2023-04-17 22:41:52,172 INFO: Waiting for '/opt/apps' to be mounted...
2023-04-17 22:41:52,188 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-04-17 22:41:52,195 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-04-17 22:42:07,172 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:42:07,193 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:42:08,189 INFO: Waiting for '/home' to be mounted...
2023-04-17 22:42:08,196 INFO: Waiting for '/opt/apps' to be mounted...
2023-04-17 22:42:08,211 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-04-17 22:42:08,217 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-04-17 22:42:23,194 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:42:23,214 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:42:24,212 INFO: Waiting for '/home' to be mounted...
2023-04-17 22:42:24,218 INFO: Waiting for '/opt/apps' to be mounted...
2023-04-17 22:42:24,233 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-04-17 22:42:24,239 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-04-17 22:42:39,215 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:42:39,234 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:42:40,234 INFO: Waiting for '/home' to be mounted...
2023-04-17 22:42:40,240 INFO: Waiting for '/opt/apps' to be mounted...
2023-04-17 22:42:40,255 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-04-17 22:42:40,262 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-04-17 22:42:55,235 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:42:55,257 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:42:56,256 INFO: Waiting for '/home' to be mounted...
2023-04-17 22:42:56,263 INFO: Waiting for '/opt/apps' to be mounted...
2023-04-17 22:42:56,279 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-04-17 22:42:56,287 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-04-17 22:43:11,258 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:43:11,282 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:43:12,280 INFO: Waiting for '/home' to be mounted...
2023-04-17 22:43:12,288 INFO: Waiting for '/opt/apps' to be mounted...
2023-04-17 22:43:12,300 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-04-17 22:43:12,308 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-04-17 22:43:27,283 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:43:27,305 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:43:28,302 INFO: Waiting for '/home' to be mounted...
2023-04-17 22:43:28,308 INFO: Waiting for '/opt/apps' to be mounted...
2023-04-17 22:43:28,322 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-04-17 22:43:28,329 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-04-17 22:43:43,306 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:43:43,327 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:43:44,323 INFO: Waiting for '/home' to be mounted...
2023-04-17 22:43:44,329 INFO: Waiting for '/opt/apps' to be mounted...
2023-04-17 22:43:44,345 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-04-17 22:43:44,352 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-04-17 22:43:59,328 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:43:59,348 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:44:00,346 INFO: Waiting for '/home' to be mounted...
2023-04-17 22:44:00,353 INFO: Waiting for '/opt/apps' to be mounted...
2023-04-17 22:44:00,365 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-04-17 22:44:00,371 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-04-17 22:44:15,349 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:44:15,368 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:44:16,366 INFO: Waiting for '/home' to be mounted...
2023-04-17 22:44:16,372 INFO: Waiting for '/opt/apps' to be mounted...
2023-04-17 22:44:16,387 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-04-17 22:44:16,394 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-04-17 22:44:31,369 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:44:31,389 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:44:32,387 INFO: Waiting for '/home' to be mounted...
2023-04-17 22:44:32,395 INFO: Waiting for '/opt/apps' to be mounted...
2023-04-17 22:44:32,409 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-04-17 22:44:32,416 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-04-17 22:44:47,390 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:44:47,410 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:44:48,410 INFO: Waiting for '/home' to be mounted...
2023-04-17 22:44:48,416 INFO: Waiting for '/opt/apps' to be mounted...
2023-04-17 22:44:48,435 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-04-17 22:44:48,442 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-04-17 22:45:03,411 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:45:03,431 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:45:04,436 INFO: Waiting for '/opt/apps' to be mounted...
2023-04-17 22:45:04,443 INFO: Waiting for '/home' to be mounted...
2023-04-17 22:45:04,458 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-04-17 22:45:04,465 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-04-17 22:45:19,432 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:45:19,454 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:45:20,459 INFO: Waiting for '/opt/apps' to be mounted...
2023-04-17 22:45:20,466 INFO: Waiting for '/home' to be mounted...
2023-04-17 22:45:20,487 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-04-17 22:45:20,494 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-04-17 22:45:35,455 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:45:35,475 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:45:36,488 INFO: Waiting for '/opt/apps' to be mounted...
2023-04-17 22:45:36,494 INFO: Waiting for '/home' to be mounted...
2023-04-17 22:45:36,508 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-04-17 22:45:36,515 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-04-17 22:45:51,476 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:45:51,500 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:45:52,509 INFO: Waiting for '/opt/apps' to be mounted...
2023-04-17 22:45:52,515 INFO: Waiting for '/home' to be mounted...
2023-04-17 22:45:52,531 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-04-17 22:45:52,538 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-04-17 22:46:07,501 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:46:07,525 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:46:08,532 INFO: Waiting for '/opt/apps' to be mounted...
2023-04-17 22:46:08,539 INFO: Waiting for '/home' to be mounted...
2023-04-17 22:46:08,551 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-04-17 22:46:08,557 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-04-17 22:46:23,526 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:46:23,547 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:46:24,552 INFO: Waiting for '/opt/apps' to be mounted...
2023-04-17 22:46:24,558 INFO: Waiting for '/home' to be mounted...
2023-04-17 22:46:24,576 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-04-17 22:46:24,583 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-04-17 22:46:39,548 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:46:39,569 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:46:40,577 INFO: Waiting for '/home' to be mounted...
2023-04-17 22:46:40,583 INFO: Waiting for '/opt/apps' to be mounted...
2023-04-17 22:46:40,598 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-04-17 22:46:40,605 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-04-17 22:46:55,570 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:46:55,592 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:46:56,599 INFO: Waiting for '/home' to be mounted...
2023-04-17 22:46:56,605 INFO: Waiting for '/opt/apps' to be mounted...
2023-04-17 22:46:56,621 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-04-17 22:46:56,627 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-04-17 22:47:11,593 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:47:11,613 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:47:12,622 INFO: Waiting for '/home' to be mounted...
2023-04-17 22:47:12,627 INFO: Waiting for '/opt/apps' to be mounted...
2023-04-17 22:47:12,644 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-04-17 22:47:12,651 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-04-17 22:47:27,613 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:47:27,635 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:47:28,645 INFO: Waiting for '/home' to be mounted...
2023-04-17 22:47:28,652 INFO: Waiting for '/opt/apps' to be mounted...
2023-04-17 22:47:28,667 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-04-17 22:47:28,675 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-04-17 22:47:43,636 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:47:43,658 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:47:44,668 INFO: Waiting for '/home' to be mounted...
2023-04-17 22:47:44,676 INFO: Waiting for '/opt/apps' to be mounted...
2023-04-17 22:47:44,688 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-04-17 22:47:44,695 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-04-17 22:47:59,659 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:47:59,680 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:48:00,689 INFO: Waiting for '/home' to be mounted...
2023-04-17 22:48:00,695 INFO: Waiting for '/opt/apps' to be mounted...
2023-04-17 22:48:00,711 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-04-17 22:48:00,717 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-04-17 22:48:15,681 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:48:15,701 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:48:16,712 INFO: Waiting for '/home' to be mounted...
2023-04-17 22:48:16,718 INFO: Waiting for '/opt/apps' to be mounted...
2023-04-17 22:48:16,734 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-04-17 22:48:16,739 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-04-17 22:48:31,702 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:48:31,722 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:48:32,735 INFO: Waiting for '/home' to be mounted...
2023-04-17 22:48:32,740 INFO: Waiting for '/opt/apps' to be mounted...
2023-04-17 22:48:32,756 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-04-17 22:48:32,764 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-04-17 22:48:47,723 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:48:47,743 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:48:48,757 INFO: Waiting for '/home' to be mounted...
2023-04-17 22:48:48,765 INFO: Waiting for '/opt/apps' to be mounted...
2023-04-17 22:48:48,777 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-04-17 22:48:48,784 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
2023-04-17 22:49:03,744 INFO: Waiting for '/opt/cluster' to be mounted...
2023-04-17 22:49:03,764 ERROR: mount of path '/opt/cluster' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/cluster']' returned non-zero exit status 32.
2023-04-17 22:49:03,766 ERROR: CalledProcessError:
    command=['mount', '/opt/cluster']
    returncode=32
    stdout:

    stderr:
mount.nfs: access denied by server while mounting alpha9ee24-controller:/opt/cluster

2023-04-17 22:49:03,767 ERROR: Aborting setup...

Screenshots

image

Execution environment

TKFE Frontend GUI

Additional context

Terraform log attached.
terraformlog.txt

Spack broken on Slurm v5 due to hardcoded 300s custom script timeout

Describe the bug

When deploying a blueprint that includes long running builds, SlurmGCP v5 has an hardcoded 300 seconds timeout and it fails.
https://github.com/SchedMD/slurm-gcp/blob/v5.1.0/scripts/setup.py#L564

Find please at the end of this report the blueprint

Follows the resulting error log (taken from /slurm/scripts/setup.log)

2022-09-09 21:35:17,466 DEBUG: run_custom_scripts: custom scripts to run: /slurm/custom_scripts/(login_6onjpjm6.d/ghpc_startup.sh)
2022-09-09 21:35:17,466 INFO: running script ghpc_startup.sh
2022-09-09 21:40:17,488 ERROR: TimeoutExpired:
    command=/slurm/custom_scripts/login_6onjpjm6.d/ghpc_startup.sh
    timeout=300
    stdout:

Steps to reproduce

Steps to reproduce the behavior:

  1. ghpc create <yaml blueprint file>
  2. terraform -chdir=internal-hpc/primary init
  3. terraform -chdir=internal-hpc/primary validate
  4. terraform -chdir=internal-hpc/primary apply

Expected behavior

Just as much in Slurm v4 I was able to build stuff that takes long time (like Gromacs and GCC), I should be able to achieve the same in Slurm v5.

Actual behavior

I cannot have a fully working setup that includes packages that takes long to compile.

Version (ghpc --version)

ghpc version v1.4.1

Blueprint

---
blueprint_name: hpc-cluster
vars:
  project_id: internal-hpc
  deployment_name: internal-hpc
  region: europe-west4
  zone: europe-west4-b

# Documentation for each of the modules used below can be found at
# https://github.com/GoogleCloudPlatform/hpc-toolkit/blob/main/modules/README.md

deployment_groups:
- group: primary
  modules:
  - source: modules/network/vpc
    kind: terraform
    settings:
      mtu: 8896
    id: network1

  ## Filesystems
  - source: modules/file-system/filestore
    kind: terraform
    id: appsfs
    use: [network1]
    settings:
      local_mount: /sw
      filestore_tier: BASIC_HDD
      size_gb: 1024

  - source: modules/file-system/filestore
    kind: terraform
    id: homefs
    use: [network1]
    settings:
      local_mount: /home
      filestore_tier: BASIC_SSD
      size_gb: 3072

  ## Spack Install Scripts
  - source: community/modules/scripts/spack-install
    kind: terraform
    id: spack
    settings:
      install_dir: /sw/spack
      spack_url: https://github.com/spack/spack
      spack_ref: v0.18.1
      log_file: /var/log/spack.log
      configs:
      - type: single-config
        scope: defaults
        content: "config:build_stage:/sw/spack/spack-stage"
      - type: file
        scope: defaults
        content: |
          modules:
            default:
              tcl:
                hash_length: 0
                all:
                  conflict:
                    - '{name}'
                projections:
                  all: '{name}/{version}-{compiler.name}-{compiler.version}'
      compilers:
      - [email protected] target=x86_64
      packages:
      - [email protected]%[email protected] target=x86_64_v3
      - [email protected]%[email protected] target=x86_64_v3
      ## https://github.com/spack/spack/blob/v0.18.1/lib/spack/external/archspec/json/cpu/microarchitectures.json

  - source: modules/scripts/startup-script
    kind: terraform
    id: spack-startup
    settings:
      runners:
      - type: shell
        source: modules/startup-script/examples/install_ansible.sh
        destination: install_ansible.sh
      - $(spack.install_spack_deps_runner)
      - $(spack.install_spack_runner)

  - source: community/modules/compute/schedmd-slurm-gcp-v5-partition
    kind: terraform
    id: c2d
    use:
    - network1
    - homefs
    - appsfs
    settings:
      partition_name: c2d
      node_count_dynamic_max: 10
      enable_placement: true
      machine_type: "c2d-highcpu-112"
      disk_type: "pd-standard"
      disable_smt: true
      bandwidth_tier: "gvnic_enabled"
      is_default: true

  - source: community/modules/compute/schedmd-slurm-gcp-v5-partition
    kind: terraform
    id: c2
    use:
    - network1
    - homefs
    - appsfs
    settings:
      partition_name: c2
      node_count_dynamic_max: 20
      enable_placement: true
      machine_type: "c2-standard-60"
      disk_type: "pd-standard"
      disable_smt: true
      bandwidth_tier: "gvnic_enabled"

  - source: community/modules/scheduler/schedmd-slurm-gcp-v5-controller
    kind: terraform
    id: slurm_controller
    use:
    - network1
    - homefs
    - appsfs
    - c2d
    - c2
    settings:
      disk_type: "pd-ssd"
      disk_size_gb: 50
      machine_type: "t2d-standard-2"
      disable_controller_public_ips: true

  - source: community/modules/scheduler/schedmd-slurm-gcp-v5-login
    kind: terraform
    id: slurm_login
    use:
    - network1
    - homefs
    - appsfs
    - slurm_controller
    - spack-startup
    settings:
      disk_type: "pd-ssd"
      disk_size_gb: 50
      machine_type: "t2d-standard-32"

  - source: modules/monitoring/dashboard
    kind: terraform
    id: hpc_dashboard

Partition a208 misconfigured in hpc-interprise-slurm.yaml

Partition a208 in hpc-enterprise-slurm.yaml doesn't appear to be properly configured for its machine type (a2-ultragpu-8g). The reason is that this machine has two sockets, and two-socket 8-GPU systems usually are split into two zones, 1 socket + 4 GPUs per zone.

When using asynchronous GPU memory transfers, some applications can bypass GPU and write to memory directly, as long as they are in the same zone. This is used by using pin_memory=True setting in PyTorch which is the standard.

However, that .yaml creates Slurm allocation without topology awareness. You end up 1 GPU jobs jobs being assigned CPUs/GPUs from different zones, this gives errors like below when using PyTorch on that config:

srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x000000000001.

Creating Filestore via Front End is broken due to missing required setting (network_id)

When creating a new filestore via the front end interface, this errors out as the FS module that is generated as part of the blueprint is missing a required setting (see below).

[root@ofe front-end]# cat fs/fs_11/ghpc_create_log.stderr
2023/03/29 21:51:13 failed to apply deployment variables in modules when expanding the config: a required setting is missing from a module: Module ID: small-fs Setting: network_id

This is the blueprint that is automatically generated by the front end (broken):

blueprint_name: small-fs

vars:
  project_id: injae-sandbox-340804
  deployment_name: small-fs
  region: asia-southeast1
  zone: asia-southeast1-c
  labels:
    created_by: fggf-server

deployment_groups:
- group: primary
  modules:
  - source: modules/file-system/filestore
    kind: terraform
    id: small-fs
    settings:
      filestore_share_name: home
      network_name: sweet-poodle-network
      zone: asia-southeast1-c
      size_gb: 1024
      filestore_tier: BASIC_HDD
    outputs:
    - network_storage

By manually editing the modules to add a network module instead of the network_name param in settings, the filestore can now deploy via front end.

blueprint_name: small-fs

vars:
  project_id: injae-sandbox-340804
  deployment_name: small-fs
  region: asia-southeast1
  zone: asia-southeast1-c
  labels:
    created_by: fggf-server

deployment_groups:
- group: primary
  modules:
  - source: modules/file-system/filestore
    kind: terraform
    id: small-fs
    settings:
      filestore_share_name: home
      network_name: sweet-poodle-network
      zone: asia-southeast1-c
      size_gb: 1024
      filestore_tier: BASIC_HDD
    outputs:
    - network_storage

Question: Has there been changes to the ghpc validator to check if network module is used and enforced (in filestore module and other existing modules)?

OFE deployment fails when deploying from macOS.

Describe the bug

OFE deployment fails when deploying from macOS.

Steps to reproduce

Steps to reproduce the behaviour:

  1. clone develop branch of hpc-toolkit
  2. run community/front-end/ofe/deploy.sh on macOS
  3. deployment will fail.

Expected behaviour

Successfully deployed OFE in GCP

Actual behaviour

OFE VM startup script fails because OFE source code is put in the wrong directory.
startup script expect hpc-toolkit to be in /opt/gcluster/, however when deployed from macOS it never end up in there.
Certain commands used in deploy.sh are not compatible with macOS. For example:

macOS does not support --recursive flag used with cp command, or rm does not accept --force --recursive.

Version (ghpc --version)

v1.13.0

Output and logs

Mar  3 11:06:52 mac-dep-server google_metadata_script_runner[1454]: startup-script-url: tar: Removing leading `../' from member names
Mar  3 11:06:52 mac-dep-server google_metadata_script_runner[1454]: startup-script-url: tar: ../._hpc-toolkit: Member name contains '..'
Mar  3 11:06:52 mac-dep-server google_metadata_script_runner[1454]: startup-script-url: tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.provenance'
Mar  3 11:06:52 mac-dep-server google_metadata_script_runner[1454]: startup-script-url: tar: ../hpc-toolkit/: Member name contains '..'

Execution environment

  • OS: macOS
  • Shell zsh
  • go version 1.19.5

Failed to download DDN exascaler module

Describe the bug

The DDN exascaler module cannot be downloaded from github.

terraform output:

│ Error: Failed to download module
│ 
│ Could not download module "ddn_exascaler" (modules/DDN-EXAScaler/main.tf:38) source code from
│ "git::https://github.com/DDNStorage/exascaler-cloud-terraform.git?ref=78deadb": error downloading
│ 'https://github.com/DDNStorage/exascaler-cloud-terraform.git?ref=78deadb': /usr/bin/git exited with 128: Cloning into
│ '.terraform/modules/tmplustrefs.ddn_exascaler'...
│ remote: Repository not found.
│ fatal: repository 'https://github.com/DDNStorage/exascaler-cloud-terraform.git/' not found

Steps to reproduce

Steps to reproduce the behavior:

  1. example blueprint:
blueprint_name: create-image

vars:
  project_id:  ## GCP Project ID ##
  ssh_username: ## Local username ##
  deployment_name: create-image
  region: us-central1
  zone: us-central1-c
  startup_timeouts: 300
  network_name: slurm-gcp-v5-net
  subnetwork_name: slurm-gcp-v5-primary-subnet

deployment_groups:
- group: tmplustre-and-scripts
  modules:
  - id: network0
    source: modules/network/pre-existing-vpc
  - id: tmplustrefs
    source: community/modules/file-system/DDN-EXAScaler
    use: [network0]
    settings:
      local_mount: /mnt/exacloud
      mgs:
        node_type: n2-standard-2
        node_cpu: Intel Cascade Lake
        nic_type: GVNIC
        public_ip: true
        node_count: 1
      mds:
        node_type: n2-standard-2
        node_cpu: Intel Cascade Lake
        nic_type: GVNIC
        public_ip: true
        node_count: 1
      mdt:
        disk_bus: SCSI
        disk_type: pd-ssd
        disk_size: 100
        disk_count: 1
        disk_raid: false
      oss:
        node_type: n2-standard-2
        node_cpu: Intel Cascade Lake
        nic_type: GVNIC
        public_ip: true
        node_count: 1
      ost:
        disk_bus: SCSI
        disk_type: pd-standard
        disk_size: 500
        disk_count: 1
        disk_raid: false
  1. deploy:
ghpc create blueprints/create-dvmdostem-image.yaml --vars project_id=$PROJECTID --vars ssh_username=$USER
terraform -chdir=create-image/tmplustre-and-scripts init
terraform -chdir=create-image/tmplustre-and-scripts validate
terraform -chdir=create-image/tmplustre-and-scripts apply

Expected behavior

DDN Exascaler lustre filesystem will be created

Actual behavior

The module cannot be found

Version (ghpc --version)

ghpc version v1.9.0
Built from 'main' branch.
Commit info: v1.9.0-0-g0f0f70ea

Expanded Blueprint

If applicable, please attach or paste the expanded blueprint. The expanded blueprint can be obtained by running ghpc expand your-blueprint.yaml.

Disregard if the bug occurs when running ghpc expand ... as well.

Output and logs

│ Error: Failed to download module
│ 
│ Could not download module "ddn_exascaler" (modules/DDN-EXAScaler/main.tf:38) source code from
│ "git::https://github.com/DDNStorage/exascaler-cloud-terraform.git?ref=78deadb": error downloading
│ 'https://github.com/DDNStorage/exascaler-cloud-terraform.git?ref=78deadb': /usr/bin/git exited with 128: Cloning into
│ '.terraform/modules/tmplustrefs.ddn_exascaler'...
│ remote: Repository not found.
│ fatal: repository 'https://github.com/DDNStorage/exascaler-cloud-terraform.git/' not found
│ 
╵

co-tkosciuch:GCP-Slurm coadmin$ terraform -chdir=create-image/tmplustre-and-scripts validate
╷
│ Error: Module not installed
│ 
│   on modules/DDN-EXAScaler/main.tf line 38:
│   38: module "ddn_exascaler" {
│ 
│ This module is not yet installed. Run "terraform init" to install all modules required by this configuration.

Execution environment

  • OS: macOS
  • Shell: bash
  • go version: go version go1.19.3 darwin/amd64

[RFE] Parallel Filestore deploy and destroy

Currently, when deploying a blueprint with two Filestore, the deployment happens one after the other.
This process is suboptimal because there is no technical reason to wait for the first Filestore to be ready before performing the next deployment.
Likewise, when destroying a cluster, Filestore resource deletion happens sequentially.

I'd like to propose improving this process and considerably shirking both deployment and deletion timelines.

WDYT?

User management best practices/examples

Hi, I'm trying to set up a basic cluster for our group and I'm very pleased with the documentation and ease of setting things up.
What I'm missing, though, is how to easily manage users over the whole cluster -- I can add users manually to login node but (obviously) these are not distributed across the whole cluster. Is there a good way how to do this without getting into NIS or LDAP setup? Thank you!

remote-desktop module should complete the setup process on behalf of the user

Describe the feature request

The remote-desktop module currently require the user to use https://remotedesktop.google.com/ in order to copy paste a chromoting setup command snippet in ssh session.

This is required because an access code needs to be negociated and passed to the chrome-remote-desktop/start-host command in order to complete the authorization flow.

It would be nice if instead this access token could be negociated before deployment, or interactively the chrome-remote-desktop startup sequence, without requiring the user to ssh in the instance.

How to use image-builder.yaml to install a docker image to template VM

Hi,

I'm using the image-builder blue print to create an auto-scaling HPC that loads each compute node as a VM based on a template image. I have got it working so the template image has docker installed, which I achieved by modifying the contents of the startup script module. However, it would be very useful if I could also pull a private docker repo/image to the template image, since then I won't have to run docker pull each time a compute node is booted up, as this is very time consuming (26GB image).

I envisage there are two ways of achieving this, however I haven't fully worked out the plan for either.

  1. Update the contents of the start up script so that it pulls the docker image, however this would require authenticating the temporary VM, which I understand to be an interactive process (requiring an authentication code or key, and I don't know how i would supply this to the process).

  2. Create a VM template based on the template generated by packer, and in that VM pull the docker image. Save this new VM as a template and specify it in the blueprint rather than specifying the packer generated VM template. The issue I have encountered here is being unable to actually create a VM based on a packer template, each time I try it never appears. I am wondering if this is because the template VM is meant to auto-close / delete itself?

Any advice on either of these options would be greatly appreciated.

Thanks,
Noah

Ensure Terraform enables the required GCP API

Currently, HPC-Toolkit asks the cloud operator to manually enable the required GCP resource API(s).
Would be nice if Terraform would enable them automatically. Additionally, it would be the overall solution less error-prone.

https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/google_project_service

See the following sample error generated from a GCP empty project where the file.googleapis.com API was not enabled.

│ Error: Error creating Instance: googleapi: Error 403: Cloud Filestore API has not been used in project XXXXXXXXXX before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/file.googleapis.com/overview?project=XXXXXXXXXX then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry.
│ Details:
│ [
│   {
│     "@type": "type.googleapis.com/google.rpc.Help",
│     "links": [
│       {
│         "description": "Google developers console API activation",
│         "url": "https://console.developers.google.com/apis/api/file.googleapis.com/overview?project=XXXXXXXXXX"
│       }
│     ]
│   },
│   {
│     "@type": "type.googleapis.com/google.rpc.ErrorInfo",
│     "domain": "googleapis.com",
│     "metadata": {
│       "consumer": "projects/XXXXXXXXXX",
│       "service": "file.googleapis.com"
│     },
│     "reason": "SERVICE_DISABLED"
│   }
│ ]
│
│   with module.homefs.google_filestore_instance.filestore_instance,
│   on modules/filestore/main.tf line 35, in resource "google_filestore_instance" "filestore_instance":
│   35: resource "google_filestore_instance" "filestore_instance" {
│

Spack not really usable by non-root users

Describe the bug

In a freshly built environment, when I try to install new software through Spack, I need to run it with sudo (just like any other package manager) otherwise the I get some basic filesystem permission issues

==> Waiting for intel-mpi-2018.4.274-7d33ducjgmxpdqaksszmp3bplefkug6x
==> Error: Failed to acquire a write lock for intel-mpi-2018.4.274-7d33ducjgmxpdqaksszmp3bplefkug6x due to LockROFileError: Can't take write lock on read-only file: /sw/spack/opt/spack/.spack-db/prefix_lock
==> Error: Can't take write lock on read-only file: /sw/spack/opt/spack/.spack-db/prefix_lock

I guess, then the question is: is Spack supposed to be consumed by regular end-users?

  • if the answer is yes, then we need to fix the permission of the entire /sw/spack directory (consistently I'd like to add)
  • if the answer is no, then, 1st we need to document how to run Spack (likely sudo -E) and then try to add it to the path while sudoing without forcing users to use full paths
[admin@internalhp-login-4sses53j-001 ~]$ which spack
/sw/spack/bin/spack
[admin@internalhp-login-4sses53j-001 ~]$ spack --version
0.18.1
[admin@internalhp-login-4sses53j-001 ~]$ sudo spack --version
sudo: spack: command not found
[admin@internalhp-login-4sses53j-001 ~]$ sudo -E spack --version
sudo: spack: command not found
[admin@internalhp-login-4sses53j-001 ~]$ sudo env|grep -i spack
[admin@internalhp-login-4sses53j-001 ~]$ sudo -E env|grep ^PATH
PATH=/sbin:/bin:/usr/sbin:/usr/bin
[admin@internalhp-login-4sses53j-001 ~]$ sudo -E env|grep -i spack
SPACK_ROOT=/sw/spack
__LMOD_REF_COUNT_MODULEPATH=/sw/spack/share/spack/modules/linux-centos7-x86_64:1;/sw/spack/share/spack/modules/linux-centos7-x86_64_v3:1;/etc/modulefiles:1;/usr/share/modulefiles:1;/opt/apps/modulefiles:1
SPACK_LD_LIBRARY_PATH=/usr/local/cuda/lib64:
MODULEPATH=/sw/spack/share/spack/modules/linux-centos7-x86_64:/sw/spack/share/spack/modules/linux-centos7-x86_64_v3:/etc/modulefiles:/usr/share/modulefiles:/opt/apps/modulefiles
SPACK_PYTHON=/usr/bin/python3
[admin@internalhp-login-4sses53j-001 ~]$ sudo -E /sw/spack/bin/spack compilers
==> Available compilers
-- clang centos7-x86_64 -----------------------------------------
[email protected]

-- gcc centos7-x86_64 -------------------------------------------
[email protected]  [email protected]

Please note that from sudo -E, spack is missing from the PATH due to the sudo enforcing the default secure_path

WDYT?

NFS server file system bug

If you find a similar existing issue, please comment on that issue instead of creating a new one.

If you are submitting a feature request, please start a discussion instead of creating an issue.

Describe the bug

When using nfs-server as a file system boot disk of the nfs-server instance is shared while attached disk remains unmounted, which leads to smaller then expected shared volume size, also in this configuration additional disk contains cent os and has 20gb file system. In some cases it could be other way around when attached disk gets mounted as root and is shared and boot disk remains unmounted.

Steps to reproduce

Steps to reproduce the behavior:

  1. Create hpc cluster project with nfs-server as file system

Expected behavior

Boot disk mounted as root additional disk mounted as /exports/data

Actual behavior

Boot disk mounted as root additional disk is not mounted at all OR Additional disk mounted as root, boot disk is not mounted at all

Version (ghpc --version)

Blueprint

If applicable, attach or paste the blueprint YAML used to produce the bug.

Expanded Blueprint

If applicable, please attach or paste the expanded blueprint. The expanded blueprint can be obtained by running ghpc expand your-blueprint.yaml.

Disregard if the bug occurs when running ghpc expand ... as well.

Output and logs


Screenshots

Nfs server on the left, controller node on the right
image

Execution environment

  • OS: [macOS, ubuntu, ...]
  • Shell (To find this, run ps -p $$): [bash, zsh, ...]
  • go version:

Additional context

Add any other context about the problem here.

cannot run 'make' to build HPC Toolkit

The problem seems to be in the Makefile with this check:

PK_VERSION_CHECK=$(shell expr packer version | head -n1 | cut -f 2- -d ' ' | cut -c 2-\>= 1.6)

One some Linux distributions packer means something else. For example:

/usr/sbin/packer
[lining@tulane hpc-toolkit (develop)]$ ls -l /usr/sbin/packer
lrwxrwxrwx. 1 root root 15 May 11  2019 /usr/sbin/packer -> cracklib-packer
[lining@tulane hpc-toolkit (develop)]$

This will cause make to fail miserably.

Hashicorp suggests to use an alternative name packer.io.

Cannot create worker node

Hi

I ran the command "srun -N 1 hostname" in the login node. It tries to spin up the vm "hpcsmall-debug-ghpc-0" but fails to create it. I have found an error in gcp log with message: "Error removing user: mkdir /home/packer/.ssh: no such file or directory."

Full error details are provided hereafter.

{
insertId: "1kw5o1beyxbxu"
jsonPayload: {
localTimestamp: "2023-07-13T16:06:35.1660Z"
message: "Error removing user: mkdir /home/packer/.ssh: no such file or directory."
omitempty: null
}
labels: {
instance_name: "hpcsmall-debug-ghpc-0"
}
logName: "projects/prj-n-005-cloudops-618d/logs/GCEGuestAgent"
receiveTimestamp: "2023-07-13T16:06:35.710523191Z"
resource: {
labels: {
instance_id: "3405041953608146457" (instance_name: hpcsmall-debug-ghpc-0)
project_id: "prj-n-005-cloudops-618d"
zone: "northamerica-northeast1-c"
}
type: "gce_instance"
}
severity: "ERROR"
sourceLocation: {
file: "non_windows_accounts.go"
function: "main.(*accountsMgr).set"
line: "161"
}
timestamp: "2023-07-13T16:06:35.166029824Z"
}

We deployed a SLURM cluster from the following link
https://cloud.google.com/hpc-toolkit/docs/quickstarts/slurm-cluster

remote-desktop module should provision remote chromoting tools

Describe the bug

After deploying https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/main/community/modules/remote-desktop/chrome-remote-desktop and ssh'ing into the vm instance and running the chromoting setup command snippet according to the provided instructions, the chromoting tools don't seem to be installed on the system.

Steps to reproduce

https://github.com/GoogleCloudPlatform/hpc-toolkit/blob/main/community/modules/remote-desktop/chrome-remote-desktop/README.md#setting-up-the-remote-desktop

Expected behavior

Setup process continue and ask the user to enter a pin.

Actual behavior

The chromoting setup command snippet fails with the following command:

-bash: /opt/google/chrome-remote-desktop/start-host: No such file or directory

Version (ghpc --version)

hpc-toolkit 👺 ./ghpc --version
ghpc version - not built from official release
Built from 'develop' branch.
Commit info: v1.14.0-84-g907d0e7d

Blueprint

blueprint_name: remote-desktop

vars:
  project_id: catx-demo-radlab
  deployment_name: radlab-remote-desktop
  region: us-central1
  zone: us-central1-c

deployment_groups:
- group: primary
  modules:
  - id: network1
    source: modules/network/vpc

  - id: remote-desktop
    source: community/modules/remote-desktop/chrome-remote-desktop
    use: [network1]
    settings:
      install_nvidia_driver: true

Expanded Blueprint

If applicable, please attach or paste the expanded blueprint. The expanded blueprint can be obtained by running ghpc expand your-blueprint.yaml.

Disregard if the bug occurs when running ghpc expand ... as well.

blueprint_name: remote-desktop
validators:
  - validator: test_module_not_used
    inputs: {}
  - validator: test_project_exists
    inputs:
      project_id: ((var.project_id))
  - validator: test_apis_enabled
    inputs: {}
  - validator: test_region_exists
    inputs:
      project_id: ((var.project_id))
      region: ((var.region))
  - validator: test_zone_exists
    inputs:
      project_id: ((var.project_id))
      zone: ((var.zone))
  - validator: test_zone_in_region
    inputs:
      project_id: ((var.project_id))
      region: ((var.region))
      zone: ((var.zone))
validation_level: 1
vars:
  deployment_name: radlab-remote-desktop
  labels:
    ghpc_blueprint: remote-desktop
    ghpc_deployment: radlab-remote-desktop
  project_id: catx-demo-radlab
  region: us-central1
  zone: us-central1-c
deployment_groups:
  - group: primary
    terraform_backend:
      type: ""
      configuration: {}
    modules:
      - source: modules/network/vpc
        kind: terraform
        id: network1
        modulename: ""
        use: []
        wrapsettingswith: {}
        settings:
          deployment_name: ((var.deployment_name))
          project_id: ((var.project_id))
          region: ((var.region))
        required_apis:
          ((var.project_id)):
            - compute.googleapis.com
      - source: community/modules/remote-desktop/chrome-remote-desktop
        kind: terraform
        id: remote-desktop
        modulename: ""
        use:
          - network1
        wrapsettingswith: {}
        settings:
          deployment_name: ((var.deployment_name))
          install_nvidia_driver: true
          labels:
            ghpc_role: remote-desktop
          network_self_link: ((module.network1.network_self_link))
          project_id: ((var.project_id))
          region: ((var.region))
          subnetwork_self_link: ((module.network1.subnetwork_self_link))
          zone: ((var.zone))
        required_apis:
          ((var.project_id)): []
    kind: terraform
terraform_backend_defaults:
  type: ""
  configuration: {}

Execution environment

  • OS: Linux proppy0 5.19.11-1rodete1-amd64 #1 SMP PREEMPT_DYNAMIC Debian 5.19.11-1rodete1 (2022-10-31) x86_64 GNU/Linux
  • Shell (To find this, run ps -p $$): 82709 pts/2 00:00:00 bash
  • go version: go version go1.20.1 linux/amd64

Additional context

This can be easily worked around by installing https://dl.google.com/linux/direct/chrome-remote-desktop_current_amd64.deb after ssh'ing in the vm and before running the chromoting setup command snippet.

DDN Exascaler not auto mounting to Slurm

Describe the bug

I am deploying a slurm cluster with the DDN exascaler filesystem but the mount keeps failing and retrying until it eventually stops in a failed state.

Steps to reproduce

Steps to reproduce the behavior:

./ghpc create blueprint.yaml --vars project_id=<PROJECT_ID>
terraform -chdir=slurm-lustre-v5/primary init
terraform -chdir=slurm-lustre-v5/primary validate
terraform -chdir=slurm-lustre-v5/primary apply

Expected behavior

The lustre filesystem will be mounted at the expected mount directory.

Actual behavior

Mount fails on both the login and controller Slurm nodes. When sshing into either login or controller node, this message appears: *** Slurm setup failed! Please view log: /slurm/scripts/setup.log ***

Version (ghpc --version)

1.9.0

Blueprint

As you can see from the commented out sections of the blueprint, I tried mounting lustre the way I expected it to work with use and also with the mount-at-startup workaround. Both failed for me. I also tried with 2 oss nodes and that failed to mount as well.

blueprint_name: slurm-and-lustre-v5

vars:
  project_id:  ## Set GCP Project ID Here ##
  deployment_name: slurm-lustre-v5
  region: us-central1
  zone: us-central1-c
  instance_image:
    family: schedmd-v5-slurm-22-05-4-ubuntu-2004-lts
    project: projects/schedmd-slurm-public/global/images/family

deployment_groups:
- group: primary
  modules:
  # Source is an embedded module, denoted by "modules/*" without /home/apps/dvmdostem_shared_lib//, /home/apps/dvmdostem_shared_lib/./, /
  # as a prefix. To refer to a local or community module, prefix with /home/apps/dvmdostem_shared_lib//, /home/apps/dvmdostem_shared_lib/./ or /
  # Example - /home/apps/dvmdostem_shared_lib//modules/network/pre-existing-vpc
  - id: network1
    source: modules/network/pre-existing-vpc
    settings:
      network_name: slurm-gcp-v5-net
      subnetwork_name: slurm-gcp-v5-primary-subnet

  - id: homefs
    source: modules/file-system/pre-existing-network-storage
    settings:
      server_ip: <NFS_IP>
      remote_mount: /nfsshare
      local_mount: /home
      fs_type: nfs

  - id: lustrefs
    source: community/modules/file-system/DDN-EXAScaler
    use: [network1]
    settings:
      local_mount: /mnt/lustre
      mgs:
        node_type: n2-standard-2
        node_cpu: Intel Cascade Lake
        nic_type: GVNIC
        public_ip: true
        node_count: 1
      mds:
        node_type: n2-standard-2
        node_cpu: Intel Cascade Lake
        nic_type: GVNIC
        public_ip: true
        node_count: 1
      mdt:
        disk_bus: SCSI
        disk_type: pd-ssd
        disk_size: 100
        disk_count: 1
        disk_raid: false
      oss:
        node_type: n2-standard-2
        node_cpu: Intel Cascade Lake
        nic_type: GVNIC
        public_ip: true
        node_count: 1
      ost:
        disk_bus: SCSI
        disk_type: pd-standard
        disk_size: 500
        disk_count: 1
        disk_raid: false
        
  # - id: mount-at-startup
  #   source: modules/scripts/startup-script
  #   settings:
  #     runners:
  #     - $(lustrefs.install_ddn_lustre_client_runner)
  #     - $(lustrefs.mount_runner)

  - id: debug_node_group
    source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
    settings:
      node_count_dynamic_max: 4
      machine_type: n2-standard-2

  - id: debug_partition
    source: community/modules/compute/schedmd-slurm-gcp-v5-partition
    use:
    - network1
    - homefs
    # - mount-at-startup
    - lustrefs
    - debug_node_group
    settings:
      partition_name: debug
      enable_placement: false
      is_default: true

  - id: compute_node_group
    source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
    settings:
      node_count_dynamic_max: 20

  - id: compute_partition
    source: community/modules/compute/schedmd-slurm-gcp-v5-partition
    use:
    - network1
    - homefs
    # - mount-at-startup
    - lustrefs
    - compute_node_group
    settings:
      partition_name: compute

  - id: slurm_controller
    source: community/modules/scheduler/schedmd-slurm-gcp-v5-controller
    use:
    - network1
    - homefs
    # - mount-at-startup
    - lustrefs
    - debug_partition
    - compute_partition
    settings:
      machine_type: n2-standard-2
      disk_type: pd-standard
      source_image_family: schedmd-v5-slurm-22-05-4-ubuntu-2004-lts
      source_image_project: projects/schedmd-slurm-public/global/images/family

  - id: slurm_login
    source: community/modules/scheduler/schedmd-slurm-gcp-v5-login
    use:
    - network1
    - slurm_controller
    settings:
      machine_type: n2-standard-2
      disk_type: pd-standard
      disable_login_public_ips: false
      source_image_family: schedmd-v5-slurm-22-05-4-ubuntu-2004-lts
      source_image_project: projects/schedmd-slurm-public/global/images/family

  - id: hpc_dashboard
    source: modules/monitoring/dashboard
    outputs: [instructions]

Expanded Blueprint

If applicable, please attach or paste the expanded blueprint. The expanded blueprint can be obtained by running ghpc expand your-blueprint.yaml.

Disregard if the bug occurs when running ghpc expand ... as well.

blueprint_name: slurm-and-lustre-v5
validators:
  - validator: test_project_exists
    inputs:
      project_id: ((var.project_id))
  - validator: test_apis_enabled
    inputs: {}
  - validator: test_region_exists
    inputs:
      project_id: ((var.project_id))
      region: ((var.region))
  - validator: test_zone_exists
    inputs:
      project_id: ((var.project_id))
      zone: ((var.zone))
  - validator: test_zone_in_region
    inputs:
      project_id: ((var.project_id))
      region: ((var.region))
      zone: ((var.zone))
validation_level: 1
vars:
  deployment_name: slurm-lustre-v5
  instance_image:
    family: schedmd-v5-slurm-22-05-4-ubuntu-2004-lts
    project: projects/schedmd-slurm-public/global/images/family
  labels:
    ghpc_blueprint: slurm-and-lustre-v5
    ghpc_deployment: slurm-lustre-v5
  project_id: <PROJECT_ID>
  region: us-central1
  zone: us-central1-c
deployment_groups:
  - group: primary
    terraform_backend:
      type: ""
      configuration: {}
    modules:
      - source: modules/network/pre-existing-vpc
        kind: terraform
        id: network1
        modulename: ""
        use: []
        wrapsettingswith: {}
        settings:
          network_name: slurm-gcp-v5-net
          project_id: ((var.project_id))
          region: ((var.region))
          subnetwork_name: slurm-gcp-v5-primary-subnet
        required_apis:
          ((var.project_id)):
            - compute.googleapis.com
      - source: modules/file-system/pre-existing-network-storage
        kind: terraform
        id: homefs
        modulename: ""
        use: []
        wrapsettingswith: {}
        settings:
          fs_type: nfs
          local_mount: /home
          remote_mount: /nfsshare
          server_ip: <NFS_IP>
        required_apis:
          ((var.project_id)): []
      - source: community/modules/file-system/DDN-EXAScaler
        kind: terraform
        id: lustrefs
        modulename: ""
        use:
          - network1
        wrapsettingswith: {}
        settings:
          labels:
            ghpc_role: file-system
          local_mount: /mnt/lustre
          mds:
            nic_type: GVNIC
            node_count: 1
            node_cpu: Intel Cascade Lake
            node_type: n2-standard-2
            public_ip: true
          mdt:
            disk_bus: SCSI
            disk_count: 1
            disk_raid: false
            disk_size: 100
            disk_type: pd-ssd
          mgs:
            nic_type: GVNIC
            node_count: 1
            node_cpu: Intel Cascade Lake
            node_type: n2-standard-2
            public_ip: true
          network_self_link: ((module.network1.network_self_link))
          oss:
            nic_type: GVNIC
            node_count: 1
            node_cpu: Intel Cascade Lake
            node_type: n2-standard-2
            public_ip: true
          ost:
            disk_bus: SCSI
            disk_count: 1
            disk_raid: false
            disk_size: 500
            disk_type: pd-standard
          project_id: ((var.project_id))
          subnetwork_address: ((module.network1.subnetwork_address))
          subnetwork_self_link: ((module.network1.subnetwork_self_link))
          zone: ((var.zone))
        required_apis:
          ((var.project_id)):
            - compute.googleapis.com
            - deploymentmanager.googleapis.com
            - iam.googleapis.com
            - runtimeconfig.googleapis.com
      - source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
        kind: terraform
        id: debug_node_group
        modulename: ""
        use: []
        wrapsettingswith: {}
        settings:
          instance_image: ((var.instance_image))
          labels:
            ghpc_role: compute
          machine_type: n2-standard-2
          node_count_dynamic_max: 4
          project_id: ((var.project_id))
        required_apis:
          ((var.project_id)): []
      - source: community/modules/compute/schedmd-slurm-gcp-v5-partition
        kind: terraform
        id: debug_partition
        modulename: ""
        use:
          - network1
          - homefs
          - lustrefs
          - debug_node_group
        wrapsettingswith:
          network_storage:
            - flatten(
            - )
          node_groups:
            - flatten(
            - )
        settings:
          deployment_name: ((var.deployment_name))
          enable_placement: false
          is_default: true
          labels:
            ghpc_role: compute
          network_storage:
            - ((module.homefs.network_storage))
            - ((module.lustrefs.network_storage))
          node_groups:
            - ((module.debug_node_group.node_groups))
          partition_name: debug
          project_id: ((var.project_id))
          region: ((var.region))
          subnetwork_self_link: ((module.network1.subnetwork_self_link))
          zone: ((var.zone))
        required_apis:
          ((var.project_id)):
            - compute.googleapis.com
      - source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
        kind: terraform
        id: compute_node_group
        modulename: ""
        use: []
        wrapsettingswith: {}
        settings:
          instance_image: ((var.instance_image))
          labels:
            ghpc_role: compute
          node_count_dynamic_max: 20
          project_id: ((var.project_id))
        required_apis:
          ((var.project_id)): []
      - source: community/modules/compute/schedmd-slurm-gcp-v5-partition
        kind: terraform
        id: compute_partition
        modulename: ""
        use:
          - network1
          - homefs
          - lustrefs
          - compute_node_group
        wrapsettingswith:
          network_storage:
            - flatten(
            - )
          node_groups:
            - flatten(
            - )
        settings:
          deployment_name: ((var.deployment_name))
          labels:
            ghpc_role: compute
          network_storage:
            - ((module.homefs.network_storage))
            - ((module.lustrefs.network_storage))
          node_groups:
            - ((module.compute_node_group.node_groups))
          partition_name: compute
          project_id: ((var.project_id))
          region: ((var.region))
          subnetwork_self_link: ((module.network1.subnetwork_self_link))
          zone: ((var.zone))
        required_apis:
          ((var.project_id)):
            - compute.googleapis.com
      - source: community/modules/scheduler/schedmd-slurm-gcp-v5-controller
        kind: terraform
        id: slurm_controller
        modulename: ""
        use:
          - network1
          - homefs
          - lustrefs
          - debug_partition
          - compute_partition
        wrapsettingswith:
          network_storage:
            - flatten(
            - )
          partition:
            - flatten(
            - )
        settings:
          deployment_name: ((var.deployment_name))
          disk_type: pd-standard
          labels:
            ghpc_role: scheduler
          machine_type: n2-standard-2
          network_self_link: ((module.network1.network_self_link))
          network_storage:
            - ((module.homefs.network_storage))
            - ((module.lustrefs.network_storage))
          partition:
            - ((module.debug_partition.partition))
            - ((module.compute_partition.partition))
          project_id: ((var.project_id))
          region: ((var.region))
          source_image_family: schedmd-v5-slurm-22-05-4-ubuntu-2004-lts
          source_image_project: projects/schedmd-slurm-public/global/images/family
          subnetwork_self_link: ((module.network1.subnetwork_self_link))
          zone: ((var.zone))
        required_apis:
          ((var.project_id)):
            - compute.googleapis.com
            - iam.googleapis.com
            - pubsub.googleapis.com
            - secretmanager.googleapis.com
      - source: community/modules/scheduler/schedmd-slurm-gcp-v5-login
        kind: terraform
        id: slurm_login
        modulename: ""
        use:
          - network1
          - slurm_controller
        wrapsettingswith: {}
        settings:
          controller_instance_id: ((module.slurm_controller.controller_instance_id))
          deployment_name: ((var.deployment_name))
          disable_login_public_ips: false
          disk_type: pd-standard
          labels:
            ghpc_role: scheduler
          machine_type: n2-standard-2
          network_self_link: ((module.network1.network_self_link))
          project_id: ((var.project_id))
          region: ((var.region))
          source_image_family: schedmd-v5-slurm-22-05-4-ubuntu-2004-lts
          source_image_project: projects/schedmd-slurm-public/global/images/family
          subnetwork_self_link: ((module.network1.subnetwork_self_link))
          zone: ((var.zone))
        required_apis:
          ((var.project_id)):
            - compute.googleapis.com
      - source: modules/monitoring/dashboard
        kind: terraform
        id: hpc_dashboard
        modulename: ""
        use: []
        wrapsettingswith: {}
        outputs:
          - instructions
        settings:
          deployment_name: ((var.deployment_name))
          project_id: ((var.project_id))
        required_apis:
          ((var.project_id)):
            - stackdriver.googleapis.com
    kind: terraform
terraform_backend_defaults:
  type: ""
  configuration: {}

Output and logs

output of sudo journalctl -o cat -u google-startup-scripts from the controller node

Starting startup scripts (version 20220622.00-0ubuntu2~20.04.0).
No startup scripts to run.
Starting Google Compute Engine Startup Scripts...
google-startup-scripts.service: Succeeded.
Finished Google Compute Engine Startup Scripts.
Starting startup scripts (version 20220622.00-0ubuntu2~20.04.0).
No startup scripts to run.
Starting Google Compute Engine Startup Scripts...
google-startup-scripts.service: Succeeded.
Finished Google Compute Engine Startup Scripts.
Starting startup scripts (version 20220622.00-0ubuntu2~20.04.0).
Found startup-script in metadata.
startup-script: ping -q -w1 -c1 metadata.google.internal
startup-script: Successfully contacted metadata server
startup-script: ping -q -w1 -c1 8.8.8.8
startup-script: Internet access detected
startup-script: curl: (22) The requested URL returned error: 404 Not Found
startup-script: slurmlustr-slurm-devel not found in project metadata, skipping script update
startup-script: running python cluster setup script
Starting Google Compute Engine Startup Scripts...
startup-script: INFO:googleapiclient.discovery_cache:file_cache is only supported with oauth2client<4.0.0
startup-script: ERROR:__main__:config file not found: /slurm/scripts/config.yaml
startup-script: WARNING:__main__:/slurm/scripts/config.yaml not found
startup-script: INFO:googleapiclient.discovery_cache:file_cache is only supported with oauth2client<4.0.0
startup-script: ERROR: Error while getting metadata from http://metadata.google.internal/computeMetadata/v1/project/attributes/slurmlustr-slurm-devel
startup-script: INFO: Setting up controller
startup-script: INFO: installing custom scripts:
startup-script: DEBUG: compute_service: Using version=v1 of Google Compute Engine API
startup-script: INFO: Munge key already exists. Skipping key generation.
startup-script: INFO: Set up network storage
startup-script: INFO: Setting up mount (nfs) 10.126.75.90:/nfsshare to /home
startup-script: INFO: Setting up mount (lustre) 10.0.0.59@tcp:/exacloud to /mnt/lustre
startup-script: DEBUG: <module>: disabling prometheus support
startup-script: Traceback (most recent call last):
startup-script:   File "/usr/local/lib/python3.8/dist-packages/more_executors/_impl/metrics/__init__.py", line 15, in <module>
startup-script:     from .prometheus import PrometheusMetrics
startup-script:   File "/usr/local/lib/python3.8/dist-packages/more_executors/_impl/metrics/prometheus.py", line 3, in <module>
startup-script:     import prometheus_client  # pylint: disable=import-error
startup-script: ModuleNotFoundError: No module named 'prometheus_client'
startup-script: INFO: Waiting for '/home' to be mounted...
startup-script: INFO: Waiting for '/mnt/lustre' to be mounted...
startup-script: ERROR: mount of path '/mnt/lustre' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/mnt/lustre']' returned non-zero exit status 19.
startup-script: INFO: Waiting for '/mnt/lustre' to be mounted...
startup-script: ERROR: mount of path '/mnt/lustre' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/mnt/lustre']' returned non-zero exit status 19.
startup-script: INFO: Mount point '/home' was mounted.
startup-script: INFO: Waiting for '/mnt/lustre' to be mounted...
startup-script: ERROR: mount of path '/mnt/lustre' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/mnt/lustre']' returned non-zero exit status 19.
startup-script: INFO: Waiting for '/mnt/lustre' to be mounted...

# The failed mounting of /mnt/lustre message repeats 

startup-script: ERROR: CalledProcessError:
startup-script:     command=['mount', '/mnt/lustre']
startup-script:     returncode=19
startup-script:     stdout:
startup-script:
startup-script:     stderr:
startup-script: mount.lustre: mount 10.0.0.59@tcp:/exacloud at /mnt/lustre failed: No such device
startup-script: Are the lustre modules loaded?
startup-script: Check /etc/modprobe.conf and /proc/filesystems
startup-script:
startup-script: ERROR: Aborting setup...
startup-script exit status 0
Finished running startup scripts.
google-startup-scripts.service: Succeeded.
Finished Google Compute Engine Startup Scripts.

Execution environment

  • OS: macOS
  • Shell: bash
  • go version: go1.19.3 darwin/amd64

ghpc deploy ends up in bad state when instance creation fails due to transient problem

TLDR; "ghpc deploy" fails to notice if instance creation fails due to a temporary issue for some of the nodes. Using "ghpc deploy again" doesn't fix the problem.

  1. How should cluster be redeployed in such setting?
  2. How would I get extra logs or know there's a problem?

For instance, I've launched "ghpc deploy alpha3" using blueprint here and it only launched some of the nodes:

alpha3-controller                 asia-southeast1-c  c2-standard-4                10.0.0.5      34.16.15.10  RUNNING
alpha3-login-3sa73t1e-001         asia-southeast1-c  n2-standard-8                10.0.0.6                      RUNNING
alpha3-tiny-ghpc-0                asia-southeast1-c  n2-standard-2                10.0.0.2                      RUNNING
alpha3-ultra-ghpc-14              asia-southeast1-c  a2-ultragpu-8g               10.0.0.1                      RUNNING

For problem the machines, the node creation command failed with message VM_MIN_COUNT_NOT_REACHED,TIMEOUT

resourceName: "projects/contextual-research-common/zones/asia-southeast1-c/instances/alpha3-ultra-ghpc-16"
serviceName: "compute.googleapis.com"
status: {
code: 13
message: "VM_MIN_COUNT_NOT_REACHED,TIMEOUT"

This is the terraform directory created by ghpc toolkit.

Getting "InstanceTaxonomies are not compatible for creating instance."

ghpc deploy fails to launch machine with InstanceTaxonomies are not compatible for creating instance., how do I troubleshoot this?

  • Step 1 blueprint, simple cluster with tiny partition

  • Step 2 blueprint, add c3 partition with c3-standard-4 node type, which fails to bring up machines with InstanceTaxonomies are not compatible for creating instance error

[2023-07-11T21:15:05.875] update_node: node dcluster1-c3-ghpc-0 state set to DOWN
[2023-07-11T21:15:05.875] update_node: node dcluster1-c3-ghpc-1 reason set to: [pd-standard] features and [instance_type: VIRTUAL_MACHINE
family: COMPUTE_OPTIMIZED
generation: GEN_3
cpu_vendor: INTEL
architecture: X86_64
] InstanceTaxonomies are not compatible for creating instance.

Here's the complete log

{
  "protoPayload": {
    "@type": "type.googleapis.com/google.cloud.audit.AuditLog",
    "status": {
      "code": 3,
      "message": "[pd-standard] features and [instance_type: VIRTUAL_MACHINE\nfamily: COMPUTE_OPTIMIZED\ngeneration: GEN_3\ncpu_vendor: INTEL\narchitecture: X86_64\n] InstanceTaxonomies are not compatible for creating instance."
    },
    "authenticationInfo": {
      "principalEmail": "[email protected]",
      "serviceAccountDelegationInfo": [
        {
          "firstPartyPrincipal": {
            "principalEmail": "[email protected]"
          }
        }
      ],
      "principalSubject": "serviceAccount:[email protected]"
    },
    "requestMetadata": {
      "callerIp": "34.142.203.74",
      "callerSuppliedUserAgent": "Slurm_GCP_Scripts/1.5 (GPN:SchedMD) (gzip),gzip(gfe)",
      "callerNetwork": "//compute.googleapis.com/projects/contextual-research-common/global/networks/__unknown__",
      "requestAttributes": {
        "time": "2023-07-11T21:03:06.725312Z",
        "auth": {}
      },
      "destinationAttributes": {}
    },
    "serviceName": "compute.googleapis.com",
    "methodName": "v1.compute.regionInstances.bulkInsert",
    "authorizationInfo": [
      {
        "permission": "compute.instances.create",
        "granted": true,
        "resourceAttributes": {
          "service": "compute",
          "name": "projects/contextual-research-common/zones/asia-southeast1-c/instances/unusedName",
          "type": "compute.instances"
        }
      },
      {
        "permission": "compute.disks.create",
        "granted": true,
        "resourceAttributes": {
          "service": "compute",
          "name": "projects/contextual-research-common/zones/asia-southeast1-c/disks/unusedName",
          "type": "compute.disks"
        }
      },
      {
        "permission": "compute.disks.setLabels",
        "granted": true,
        "resourceAttributes": {
          "service": "compute",
          "name": "projects/contextual-research-common/zones/asia-southeast1-c/disks/unusedName",
          "type": "compute.disks"
        }
      },
      {
        "permission": "compute.subnetworks.use",
        "granted": true,
        "resourceAttributes": {
          "service": "compute",
          "name": "projects/contextual-research-common/regions/asia-southeast1/subnetworks/contextual-a100-primary-subnet",
          "type": "compute.subnetworks"
        }
      },
      {
        "permission": "compute.instances.setMetadata",
        "granted": true,
        "resourceAttributes": {
          "service": "compute",
          "name": "projects/contextual-research-common/zones/asia-southeast1-c/instances/unusedName",
          "type": "compute.instances"
        }
      },
      {
        "permission": "compute.instances.setTags",
        "granted": true,
        "resourceAttributes": {
          "service": "compute",
          "name": "projects/contextual-research-common/zones/asia-southeast1-c/instances/unusedName",
          "type": "compute.instances"
        }
      },
      {
        "permission": "compute.instances.setLabels",
        "granted": true,
        "resourceAttributes": {
          "service": "compute",
          "name": "projects/contextual-research-common/zones/asia-southeast1-c/instances/unusedName",
          "type": "compute.instances"
        }
      },
      {
        "permission": "compute.instances.setServiceAccount",
        "granted": true,
        "resourceAttributes": {
          "service": "compute",
          "name": "projects/contextual-research-common/zones/asia-southeast1-c/instances/unusedName",
          "type": "compute.instances"
        }
      }
    ],
    "resourceName": "projects/contextual-research-common/regions/asia-southeast1/instances",
    "request": {
      "count": "2",
      "sourceInstanceTemplate": "https://www.googleapis.com/compute/v1/projects/contextual-research-common/global/instanceTemplates/dcluster1-compute-c3-ghpc-20230711210104409100000001",
      "instanceProperties": {
        "networkInterfaces": [
          {
            "subnetwork": "https://www.googleapis.com/compute/v1/projects/contextual-research-common/regions/asia-southeast1/subnetworks/contextual-a100-primary-subnet"
          }
        ],
        "disks": [
          {
            "type": "PERSISTENT",
            "mode": "READ_WRITE",
            "deviceName": "persistent-disk-0",
            "boot": true,
            "initializeParams": {
              "sourceImage": "projects/schedmd-slurm-public/global/images/family/slurm-gcp-5-7-hpc-centos-7",
              "diskSizeGb": "120",
              "diskType": "pd-standard",
              "labels": [
                {
                  "key": "ghpc_deployment",
                  "value": "dcluster1"
                },
                {
                  "key": "slurm_instance_role",
                  "value": "compute"
                },
                {
                  "key": "ghpc_blueprint",
                  "value": "dcluster"
                },
                {
                  "key": "ghpc_module",
                  "value": "schedmd-slurm-gcp-v5-node-group"
                },
                {
                  "key": "ghpc_role",
                  "value": "compute"
                },
                {
                  "key": "slurm_cluster_name",
                  "value": "dcluster1"
                }
              ]
            },
            "autoDelete": true,
            "interface": "SCSI"
          }
        ],
        "labels": [
          {
            "key": "ghpc_role",
            "value": "compute"
          },
          {
            "key": "ghpc_blueprint",
            "value": "dcluster"
          },
          {
            "key": "ghpc_deployment",
            "value": "dcluster1"
          },
          {
            "key": "ghpc_module",
            "value": "schedmd-slurm-gcp-v5-node-group"
          },
          {
            "key": "slurm_instance_role",
            "value": "compute"
          },
          {
            "key": "slurm_cluster_name",
            "value": "dcluster1"
          }
        ]
      },
      "locationPolicy": {
        "locations": [
          {
            "key": "zones/asia-southeast1-a",
            "value": {
              "preference": "DENY"
            }
          },
          {
            "key": "zones/asia-southeast1-b",
            "value": {
              "preference": "DENY"
            }
          }
        ],
        "targetShape": "ANY_SINGLE_ZONE"
      },
      "perInstanceProperties": [
        {
          "key": "dcluster1-c3-ghpc-1"
        },
        {
          "key": "dcluster1-c3-ghpc-0"
        }
      ],
      "@type": "type.googleapis.com/compute.regionInstances.bulkInsert"
    },
    "response": {
      "error": {
        "errors": [
          {
            "domain": "global",
            "reason": "badRequest",
            "message": "[pd-standard] features and [instance_type: VIRTUAL_MACHINE\nfamily: COMPUTE_OPTIMIZED\ngeneration: GEN_3\ncpu_vendor: INTEL\narchitecture: X86_64\n] InstanceTaxonomies are not compatible for creating instance."
          }
        ],
        "code": 400,
        "message": "[pd-standard] features and [instance_type: VIRTUAL_MACHINE\nfamily: COMPUTE_OPTIMIZED\ngeneration: GEN_3\ncpu_vendor: INTEL\narchitecture: X86_64\n] InstanceTaxonomies are not compatible for creating instance."
      },
      "@type": "type.googleapis.com/error"
    },
    "resourceLocation": {
      "currentLocations": [
        "asia-southeast1"
      ]
    }
  },
  "insertId": "bnmp07dhluy",
  "resource": {
    "type": "audited_resource",
    "labels": {
      "project_id": "contextual-research-common",
      "method": "compute.regionInstances.bulkInsert",
      "service": "compute.googleapis.com"
    }
  },
  "timestamp": "2023-07-11T21:03:04.635953Z",
  "severity": "ERROR",
  "logName": "projects/contextual-research-common/logs/cloudaudit.googleapis.com%2Factivity",
  "receiveTimestamp": "2023-07-11T21:03:07.345757898Z"
}

Compact Placement Policy not cleaned up

It looks like Slurm GCP suspend.py doesn't clean up the Placement Policy from previous runs.

Once the maximum number of Placement Policy allowed by the quota is reached, resume.py will report an error scheduling new jobs until all the leftovers placement policies are gone.

2022-03-22 09:49:51,783 3921 47306125384000 resume.py ERROR:  placement group operation failed: Quota 'AFFINITY_GROUPS' exceeded.  Limit: 10.0 in region europe-west4.

Follows system details

$ sinfo -l
Tue Mar 22 12:53:04 2022
PARTITION AVAIL  TIMELIMIT   JOB_SIZE ROOT OVERSUBS     GROUPS  NODES       STATE NODELIST
compute*     up   infinite 1-infinite   no EXCLUSIV        all      4       idle~ slurm-hpc-slurm-compute-0-[0-3]
$ squeue -l
Tue Mar 22 12:53:07 2022
             JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
$ sacct
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
2            hello_wor+    compute                   112  COMPLETED      0:0 
2.batch           batch                               56  COMPLETED      0:0 
2.0          mpi_hello+                              112  COMPLETED      0:0 
3              hostname    compute                   112 CANCELLED+      0:0 
4              hostname    compute                   112  COMPLETED      0:0 
5              test_mpi    compute                   168  COMPLETED      0:0 
5.batch           batch                               56  COMPLETED      0:0 
5.0           hello.mpi                              168  COMPLETED      0:0 

$ gcloud compute resource-policies list
NAME: slurm-hpc-slurm-4-1
DESCRIPTION:
REGION: https://www.googleapis.com/compute/v1/projects/hpc/regions/europe-west4
CREATION_TIMESTAMP: 2022-03-22T02:55:56.952-07:00

NAME: slurm-hpc-slurm-5-1
DESCRIPTION:
REGION: https://www.googleapis.com/compute/v1/projects/hpc/regions/europe-west4
CREATION_TIMESTAMP: 2022-03-22T05:43:40.077-07:00

Enabling the debug logs, confirms the behavior: for some reason (apologies, I don't have the time to troubleshoot this issue) suspend.py won't fully execute the delete_placement_groups

VPC subnet definitions are overly restrictive

The current requirements for a VPC created through resources/network/vpc require that each subnet's address range be a sub-block of the network_address_range parameter. This requirement is set by specifying subnets via a new_bits parameter, and the Terraform using the cidrsubnets function.

These requirements remove the ability for a user to precisely the define the subnets and address ranges they wish to use, as well as requires all created subnets to have a common prefix. This prevents (admittedly obtuse) situations like subnets in a VPC containing completely different address ranges (mixing 10. with 192.168., etc, etc).

I would like to see the ability to specify each subnet which should be created by explicitly defining it's CIDR, rather than setting the network_address_range and new_bits variables.

Error: Error creating Address: googleapi: Error 409: The resource already exists

Hi,

I'm not sure if this is the right place to provide feedback for hpc-toolkit, but I observe an issue and would like to get some assistance on that. Please feel free to close this ticket and point me to the right contact if you like.

I was able to create clusters using a slightly tweaked version of the hpc-cluster-small.yaml. I always destroyed clusters after I created them and this process has worked without any issues.

Then, I wanted to add a startup script to automate the process for installing required packages for my benchmark tests, but my initial attempt ended up with the no such file error. I located my script under ~/hpc-toolkit/resources/scripts/startup-script/examples and followed the example here, but it did not work although I tried a few options for instance with / without modules etc. So, my first question is how does the source option work? I've read the explanation in readme file, but still not very clear for me.

The second problem is the actual reason why I open this ticket. To resolve the no file problem temporarily, I decided to use the content option. Now, I don't get the no file error anymore as expected however, I cannot create clusters because of the following issues;

  List of additional subnetworks in which to create resources.
xists, alreadyExists
│
│   with module.network1.module.nat_ip_addresses["us-central1"].google_compute_address.ip[0],
│   on .terraform/modules/network1.nat_ip_addresses/main.tf line 68, in resource "google_compute_address" "ip":
│   68: resource "google_compute_address" "ip" {
│
╵
╷
│ Error: Error creating Address: googleapi: Error 409: The resource 'projects/hpc-in-the-cloud-10697259/regions/us-central1/addresses/hpc-slurm-small-intel-net-nat-ips-us-central1-1' already exists, alreadyExists
│
│   with module.network1.module.nat_ip_addresses["us-central1"].google_compute_address.ip[1],
│   on .terraform/modules/network1.nat_ip_addresses/main.tf line 68, in resource "google_compute_address" "ip":
│   68: resource "google_compute_address" "ip" {
│
╵
╷
│ Error: Error creating Network: googleapi: Error 409: The resource 'projects/hpc-in-the-cloud-10697259/global/networks/hpc-slurm-small-intel-net' already exists, alreadyExists
│
│   with module.network1.module.vpc.module.vpc.google_compute_network.network,
│   on .terraform/modules/network1.vpc/modules/vpc/main.tf line 20, in resource "google_compute_network" "network":
│   20: resource "google_compute_network" "network" {
│

I guess terraform destroy should be extended. Is there a quick workaround that I can do in the meantime?

Fatih

cloud-hpc-image-public/hpc-centos-7 fails to install google-cloud-storage via pip

Describe the bug

spack-install module fails on cloud-hpc-image-public/hpc-centos-7 image during the "Install google cloud storage" ansible task as protobuf requires Python '>=3.7' but the running Python is 3.6.8

According to google-cloud-storage Python == 3.6: the last released version which supported Python 3.6 was google-cloud-storage 2.0.0, released 2022-01-12.

The community example spack-gromacs.yaml seems to work because community/modules/compute/SchedMD-slurm-on-gcp-partition has [email protected] pre-installed.

After digging into this, it might be a bug for the hpc image itself 🤷

Depends if a workaround in this repo is worth the hassle, since it'd likely involve installing another version of Python or pinning to an outdated library.

Steps to reproduce

Steps to reproduce the behavior:

  1. Use spack-install community module on cloud-hpc-image-public/hpc-centos-7 image

Expected behavior

Pip installs google-cloud-storage successfully.

Actual behavior

Install fails due to insufficient python version available on CentOS 7

Version (ghpc --version)

v1.0.0

Blueprint

If applicable, attach or paste the blueprint YAML used to produce the bug.

---
blueprint_name: spack-buildcache

vars:
  project_id: 
  deployment_name: spack-buildcache
  region: europe-west1
  zone: europe-west1-b

deployment_groups:
- group: primary
  modules:
  - source: modules/network/pre-existing-vpc
    kind: terraform
    id: network1
    
  - source: community/modules/scripts/spack-install
    kind: terraform
    id: spack
    settings:
      install_dir: /sw/spack
      spack_url: https://github.com/spack/spack
      spack_ref: v0.18.0
      spack_cache_url:
      - mirror_name: 'gcs_cache'
        mirror_url: gs://spack-buildcache/linux-centos7
      gpg_keys:
      - type: new
        name: spack-buildcache
        email: [email protected]
      compilers:
      - [email protected] target=x86_64
      packages:
      - cmake%[email protected] target=x86_64
      caches_to_populate:
      - type: mirror
        path: gs://spack-buildcache/linux-centos7

  - source: modules/scripts/startup-script
    kind: terraform
    id: spack-startup
    settings:
      runners:
      - type: shell
        source: modules/startup-script/examples/install_ansible.sh
        destination: install_ansible.sh
      - $(spack.install_spack_deps_runner)
      - $(spack.install_spack_runner)

  - source: modules/compute/vm-instance
    kind: terraform
    id: compute
    use: [network1]
    settings:
      instance_count: 1
      name_prefix: compute
      machine_type: c2-standard-8
      threads_per_core: 2
      instance_image: {
        "family": "hpc-centos-7",
        "project": "cloud-hpc-image-public"
      }
      spot: true
      startup_script: $(spack-startup.startup_script)
      tags: [iap]

Expanded Blueprint

If applicable, please attach or paste the expanded blueprint. The expanded blueprint can be obtained by running ghpc expand your-blueprint.yaml.

Disregard if the bug occurs when running ghpc expand ... as well.

blueprint_name: spack-buildcache
validators:
- validator: test_project_exists
  inputs:
    project_id: ((var.project_id))
- validator: test_region_exists
  inputs:
    project_id: ((var.project_id))
    region: ((var.region))
- validator: test_zone_exists
  inputs:
    project_id: ((var.project_id))
    zone: ((var.zone))
- validator: test_zone_in_region
  inputs:
    project_id: ((var.project_id))
    region: ((var.region))
    zone: ((var.zone))
validation_level: 1
vars:
  deployment_name: spack-buildcache
  labels:
    ghpc_blueprint: spack-buildcache
    ghpc_deployment: spack-buildcache
  project_id: MY_PROJECT
  region: europe-west1
  zone: europe-west1-b
deployment_groups:
- group: primary
  terraform_backend:
    type: ""
    configuration: {}
  modules:
  - source: modules/network/pre-existing-vpc
    kind: terraform
    id: network1
    modulename: ""
    use: []
    wrapsettingswith: {}
    settings:
      project_id: ((var.project_id))
      region: ((var.region))
  - source: community/modules/scripts/spack-install
    kind: terraform
    id: spack
    modulename: ""
    use: []
    wrapsettingswith: {}
    settings:
      caches_to_populate:
      - path: gs://spack-buildcache/linux-centos7
        type: mirror
      compilers:
      - [email protected] target=x86_64
      gpg_keys:
      - email: [email protected]
        name: spack-buildcache
        type: new
      install_dir: /sw/spack
      packages:
      - cmake%[email protected] target=x86_64
      project_id: ((var.project_id))
      spack_cache_url:
      - mirror_name: gcs_cache
        mirror_url: gs://spack-buildcache/linux-centos7
      spack_ref: v0.18.0
      spack_url: https://github.com/spack/spack
      zone: ((var.zone))
  - source: modules/scripts/startup-script
    kind: terraform
    id: spack-startup
    modulename: ""
    use: []
    wrapsettingswith: {}
    settings:
      deployment_name: ((var.deployment_name))
      labels:
        ghpc_role: scripts
      project_id: ((var.project_id))
      region: ((var.region))
      runners:
      - destination: install_ansible.sh
        source: modules/startup-script/examples/install_ansible.sh
        type: shell
      - ((module.spack.install_spack_deps_runner))
      - ((module.spack.install_spack_runner))
  - source: modules/compute/vm-instance
    kind: terraform
    id: compute
    modulename: ""
    use:
    - network1
    wrapsettingswith: {}
    settings:
      deployment_name: ((var.deployment_name))
      instance_count: 1
      instance_image:
        family: hpc-centos-7
        project: cloud-hpc-image-public
      labels:
        ghpc_role: compute
      machine_type: c2-standard-8
      name_prefix: compute
      network_self_link: ((module.network1.network_self_link))
      project_id: ((var.project_id))
      spot: true
      startup_script: ((module.spack-startup.startup_script))
      subnetwork_self_link: ((module.network1.subnetwork_self_link))
      tags:
      - iap
      threads_per_core: 2
      zone: ((var.zone))
  kind: terraform
terraform_backend_defaults:
  type: ""
  configuration: {}

Output and logs

Jun 15 14:58:18 compute-0 google_metadata_script_runner[1135]: startup-script: TASK [Install google cloud storage] ********************************************
Jun 15 14:58:18 compute-0 ansible-pip[1965]: Invoked with virtualenv=None extra_args=None virtualenv_command=virtualenv chdir=None requirements=None name=['google-cloud-storage'] virtualenv_python=None umask=None editable=False executable=pip3 virtualenv_site_packages=False state=present version=None
Jun 15 14:58:21 compute-0 google_metadata_script_runner[1135]: startup-script: fatal: [localhost]: FAILED! => {"changed": false, "cmd": ["/usr/bin/pip3", "install", "google-cloud-storage"], "msg": "stdout: Collecting google-cloud-storage\n  Downloading https://files.pythonhosted.org/packages/82/b9/c31cfed0024c5929f0d27d13e2879d8ed9c67d37b0a85cb72de8dc3a6fa5/google_cloud_storage-2.0.0-py2.py3-none-any.whl (106kB)\nCollecting google-api-core<3.0dev,>=1.29.0 (from google-cloud-storage)\n  Downloading https://files.pythonhosted.org/packages/98/e8/2e71f021fd86361f0aabcf8644929f9041c886a52d55f18e6fe12b2e3780/google_api_core-2.8.1-py3-none-any.whl (114kB)\nCollecting google-cloud-core<3.0dev,>=1.6.0 (from google-cloud-storage)\n  Downloading https://files.pythonhosted.org/packages/ac/4f/f011ffb5f00d78630e032c27ad0650a3103982d53b17618b2c9a6950686b/google_cloud_core-2.3.1-py2.py3-none-any.whl\nCollecting requests<3.0.0dev,>=2.18.0 (from google-cloud-storage)\n  Downloading https://files.pythonhosted.org/packages/2d/61/08076519c80041bc0ffa1a8af0cbd3bf3e2b62af10435d269a9d0f40564d/requests-2.27.1-py2.py3-none-any.whl (63kB)\nCollecting google-auth<3.0dev,>=1.25.0 (from google-cloud-storage)\n  Downloading https://files.pythonhosted.org/packages/b3/33/13d090c8f70ff50425426802d100f944dce17a00654706e8c0584b3efc8f/google_auth-2.8.0-py2.py3-none-any.whl (164kB)\nCollecting google-resumable-media>=1.3.0 (from google-cloud-storage)\n  Downloading https://files.pythonhosted.org/packages/02/a3/19447ef22fdaccf773c395add9d200a6dacba3d39742d9ede0cc67c51874/google_resumable_media-2.3.3-py2.py3-none-any.whl (76kB)\nCollecting protobuf (from google-cloud-storage)\n  Downloading https://files.pythonhosted.org/packages/6c/be/4e32d02bf08b8f76bf6e59f2a531690c1e4264530404501f3489ca975d9a/protobuf-4.21.0-py2.py3-none-any.whl (164kB)\n\n:stderr: WARNING: Running pip install with root privileges is generally not a good idea. Try `pip3 install --user` instead.\nprotobuf requires Python '>=3.7' but the running Python is 3.6.8\n"}

Execution environment

  • OS: ubuntu 22.04
  • bash
  • go version: 1.18.1

Additional context

Great job with this toolkit! 🥳

Slurm setup fails in deployed blueprint - possible error getting metadata

Describe the bug

Slurm setup fails after deploying the blueprint. It's an adapted version of the slurm-enterprise example. There are a series of errors related to getting metadata information (e.g. requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http://metadata.google.internal/computeMetadata/v1/instance/attributes/slurm_bucket_path)

Steps to reproduce

  1. Deploy the attached blueprint with hpc-toolkit

Version (ghpc --version)

v1.21.0

Blueprint

blueprint.yaml.gz

Expanded Blueprint

expanded.yaml.gz

Output and logs

ghpc_deploy.log.gz
setup.log located in /slurm/scripts/.
messages.gz /var/log/messages from the controller.
router.json.gz the Cloud NAT router config.

Screenshots

If applicable, add screenshots to help explain your problem.

Execution environment

  • OS: Debian SID
  • Shell (To find this, run ps -p $$): bash
  • Conda environment: environment.yaml.gz

Missing icons in HPCTKFE

Describe the bug

Icons in the UI are not being presented.

Steps to reproduce

Steps to reproduce the behavior:

  1. Installed the FETK from the GitRepo. Have completed this twice with the same results.

Expected behavior

UI Icons to appear.

Actual behavior

No icons in UI

Version (ghpc --version)

1.13.0

Screenshot:
image

General comment

I realized a few things while running slurm jobs from the toolkit

setting up the compute partitions with n2 instances

  • "Unsupported placement policy configuration. Please utilize c2 machine type."

Also, each time I need to run make before running ./ghpc create --config or it will take the older config version, not sure where in the code its doing that, it seem likes its doing some temp folder each time ghpc make runs.

suspend_time does it really work

Disclaimer: I may be wrong with my noob Slurm understanding.

I was under the impression that the facility suspend_time would allow to keep the compute nodes around for some time after the previous job is completed. Instead, the nodes are immediately reclaimed.

  # Slurm Controller and Scheduler
  - source: resources/third-party/scheduler/SchedMD-slurm-on-gcp-controller
    kind: terraform
    id: slurm_controller
    use:
    - network1
    - homefs
    - compute_partition
    settings:
      login_node_count: 1
      boot_disk_type: "pd-ssd"
      boot_disk_size: 100
      controller_machine_type: "t2d-standard-2"
      suspend_time: 300

See the following logs

==> resume.log <==
2022-03-22 12:59:55,261 17116 47376419211584 resume.py INFO: done adding instances: slurm-hpc-slurm-compute-0-[0-1] 6
==> suspend.log <==
2022-03-22 13:00:44,414 17230 46979050408256 suspend.py DEBUG: done deleting instances
2022-03-22 13:00:44,414 17230 46979050408256 suspend.py INFO: done deleting nodes:slurm-hpc-slurm-compute-0-[1,0] job_id:6

As you can notice by the timestamp, as soon as the job was over, the nodes got deleted.

sacct -l
JobID        JobIDRaw        JobName  Partition  MaxVMSize  MaxVMSizeNode  MaxVMSizeTask  AveVMSize     MaxRSS MaxRSSNode MaxRSSTask     AveRSS MaxPages MaxPagesNode   MaxPagesTask   A
vePages     MinCPU MinCPUNode MinCPUTask     AveCPU   NTasks  AllocCPUS    Elapsed      State ExitCode AveCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov     ReqMem ConsumedEnergy  M
axDiskRead MaxDiskReadNode MaxDiskReadTask    AveDiskRead MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask   AveDiskWrite    ReqTRES  AllocTRES TRESUsageInAve TRESUsageInMax TRESUsageInM
axNode TRESUsageInMaxTask TRESUsageInMin TRESUsageInMinNode TRESUsageInMinTask TRESUsageInTot TRESUsageOutMax TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutAve TRESUsageOutTot 
------------ ------------ ---------- ---------- ---------- -------------- -------------- ---------- ---------- ---------- ---------- ---------- -------- ------------ -------------- ---
------- ---------- ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- ---------- ------------- ------------- ------------- ---------- -------------- --
---------- --------------- --------------- -------------- ------------ ---------------- ---------------- -------------- ---------- ---------- -------------- -------------- ------------
------ ------------------ -------------- ------------------ ------------------ -------------- --------------- ------------------- ------------------- --------------- --------------- 
6            6              hostname    compute    221388K slurm-hpc-slu+              0    221388K      1496K slurm-hpc+          1      1494K        0 slurm-hpc-s+              0    
      0   00:00:00 slurm-hpc+          0   00:00:00        2        112   00:00:01  COMPLETED      0:0      2.47M       Unknown       Unknown       Unknown      7936M              0   
         0 slurm-hpc-slur+               0              0            0 slurm-hpc-slurm+                0              0 billing=2+ billing=1+ cpu=00:00:00,+ cpu=00:00:00,+ cpu=slurm-hp
c-slu+ cpu=0,fs/disk=0,m+ cpu=00:00:00,+ cpu=slurm-hpc-slu+ cpu=0,fs/disk=0,m+ cpu=00:00:00,+ energy=0,fs/di+ energy=slurm-hpc-s+           fs/disk=0 energy=0,fs/di+ energy=0,fs/di+ 

Is this the expected behavior?

ghpc v1.11 cannot read packer/custom-image module

Describe the bug

While trying to create a custom image with packer I encountered this error:

2023/01/26 12:04:06 Failed to read packer source directory modules/packer/custom-image

Steps to reproduce

Steps to reproduce the behavior:

  1. ghpc create blueprints/create-dvmdostem-image.yaml --vars project_id=$PROJECTID --vars ssh_username=$USER

Expected behavior

A custom image will be created

Actual behavior

ghpc could not find the source for the packer/custom-image module

Version (ghpc --version)

ghpc version v1.11.0
Built from 'main' branch.
Commit info: v1.11.0-0-gd706498b

Blueprint

If applicable, attach or paste the blueprint YAML used to produce the bug.

blueprint_name: create-image

vars:
  project_id:  ## GCP Project ID ##
  ssh_username: ## Local username ##
  deployment_name: create-image
  region: us-central1
  zone: us-central1-c
  lustre_mgs_ip: 10.0.0.218@tcp
  startup_timeouts: 300
  network_name: slurm-gcp-v5-net
  subnetwork_name: slurm-gcp-v5-primary-subnet
  first_image_family: default-mpicc-dvmdostem-slurm
  second_image_family: default-mpicc-dvmdostem-lustre-slurm

deployment_groups:
- group: copy-image
  modules:
  - id: copy-image
    source: modules/packer/custom-image
    kind: packer
    settings:
      omit_external_ip: false
      disk_size: 32
      source_image_project_id: [schedmd-slurm-public]
      source_image_family: schedmd-v5-slurm-22-05-6-ubuntu-2004-lts
      image_family: schedmd-v5-slurm-22-05-6-ubuntu-2004-lts

Execution environment

  • OS: macOS
  • Shell bash
  • go version: go version go1.19.3 darwin/amd64

SLURM 1.20 deployed and having node creation error

          Hi 

I deployed the 1.20 version and getting the following error.

[issharif_c_cameco_com@hpcsmall-login-vicyomx9-001 ~]$ srun -N3 hostname
srun: error: Node failure on hpcsmall-debug-ghpc-0
srun: error: Nodes hpcsmall-debug-ghpc-[0-2] are still not ready
srun: error: Something is wrong with the boot of the nodes.
[issharif_c_cameco_com@hpcsmall-login-vicyomx9-001 ~]$

Regards

Originally posted by @sharif-cameco in #1581 (comment)

Update modules to support pd-balanced as available disk type

As an example, the slurmv5 module currently does not support this as an available disk type. A workaround is to remove the validator within variables.tf file or append 'pd-balanced' in the list, but having this natively suppported would be great.

condition = contains(["pd-ssd", "local-ssd", "pd-standard"], var.disk_type)

This would enable usage of pd-balanced as its a nice cost/perf disk, but is also a core dependency as it is the only boot disk type supported for our new HPC VMs

lustreapi.h not useable by ompi, nested redefinition of 'enum llapi_json_types'

Blueprint described here: #756 (comment)

This issue is a duplicate of this: DDNStorage/exascaler-cloud-terraform#14

When using DDNs Exascaler storage (lustre), OMPI cannot be configured with the --with-lustre flag

Steps to install ompi

mkdir ${MODINSTALLPATH}/ompi && cd ${MODINSTALLPATH}/ompi
wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.4.tar.gz
gunzip -c openmpi-4.1.4.tar.gz | tar xf -
cd openmpi-4.1.4
./configure --prefix=${MODINSTALLPATH}/ompi/4.1.4 --enable-mpi-cxx --with-lustre=/usr/src/lustre-client-modules-2.14.0-ddn54/lustre/
# make all install # not run because ./configure fails

The error message I'm seeing (more detail in the DDN issue):
from console:

checking lustre/lustreapi.h presence... yes
configure: WARNING: lustre/lustreapi.h: present but cannot be compiled

from ompi's config.log:

configure:317138: checking lustre/lustreapi.h usability
configure:317138: gcc -c -O3 -DNDEBUG -finline-functions -fno-strict-aliasing -mcx16 -pthread     -I/usr/local/include -I/usr/local/include -I/usr/src/lustre-client-modules-2.14.0-ddn54/lustre/include conftest.c >&5
In file included from conftest.c:779:
/usr/src/lustre-client-modules-2.14.0-ddn54/lustre/include/lustre/lustreapi.h:592:6: error: nested redefinition of 'enum llapi_json_types'
  592 | enum llapi_json_types {

Chrome-remote-desktop support

A clear and concise description of what the bug is.

Steps to reproduce

Steps to reproduce the behavior:

  1. Try to use the blueprint community/examples/slurm-chromedesktop.yaml

ghpc create slurm-chromedesktop.yaml
ghpc deploy slurm-chromedesktop

Expected behavior

A clear and concise description of what you expected to happen.

A cluster with CRD installed should be created

Actual behavior

Multiple bugs

What happened instead.

Failure #1
The source path for configure-grid-drivers.yml, configure-chrome-desktop.yml, disable-sleep.yml

has changed

Failure #2

There is an underlying dependency on
modules/compute/vm-instance
which in outputs.tf
has
gcloud compute ssh --zone ${var.zone} ${google_compute_instance.compute_vm[0].name} --project ${var.project_id}

saying this is a dynamic cluster a node is not created so google_compute_instance.compute_vm[0] is not defined so you get the error

│ 36: gcloud compute ssh --zone ${var.zone} ${google_compute_instance.compute_vm[0].name} --project ${var.project_id}
│ ├────────────────
│ │ google_compute_instance.compute_vm is empty tuple

Version (ghpc --version)

ghpc version v1.18.0

Blueprint

If applicable, attach or paste the blueprint YAML used to produce the bug.

Expanded Blueprint

If applicable, please attach or paste the expanded blueprint. The expanded blueprint can be obtained by running ghpc expand your-blueprint.yaml.

Disregard if the bug occurs when running ghpc expand ... as well.

Output and logs


Screenshots

If applicable, add screenshots to help explain your problem.

Execution environment

  • OS: [macOS, ubuntu, ...]
  • Shell (To find this, run ps -p $$): [bash, zsh, ...]
  • go version:

Additional context

Add any other context about the problem here.

SchedMD slurm image does not exist

The documentation suggests this is the most recent slurm image:

schedmd-v5-slurm-22-05-6-hpc-centos-7

But I cannot find it when building a packer image:

* googleapi: Error 404: The resource 'projects/spherical-berm-323321/global/images/family/schedmd-v5-slurm-22-05-6-hpc-centos-7' was not found, notFound
.
.
* googleapi: Error 404: The resource 'projects/ml-images/global/images/family/schedmd-v5-slurm-22-05-6-hpc-centos-7' was not found, notFound

The packer default hpc-centos-7 still exists. Is slurm on GCP still supported?

MPI arguments for MPI jobs

If you find a similar existing issue, please comment on that issue instead of creating a new one.

If you are submitting a feature request, please start a discussion instead of creating an issue.

Describe the bug

Whe are currently using Slurm on GCP run Ansys Fluent, we tried to used the two parameters to optimize MPI job: "I_MPI_FABRICS =shm:ofa" & "I_MPI_FALLBACK_DEVICE=disable "

The command we ran is like below:

fluent -cnf=host.txt 3ddp -t448 -g -nm -hidden -affinity=off -mpiopt='-genv I_MPI_PIN=1 -genv FI_VERBS_IFACE=eth0' -i batch.jou > logfile.out

The purpose we set up the parameter "-mpipot" was mainly for avoiding scheduler to affect the allocation of CPU, and also to use eth0 as the interface of exchanging data between MPI jobs. However, we saw the followoing error and also the job was stuck there after we ran the job:
"mpi startup() ofa fabric is not available and fallback fabric is not enabled"

However, if we remove the two parameters mentioned at the beginning, then the command works well. Based on this test result, it seems the issue was related to the inexistance of Infiniband device, however, based on our standing even though there is no Infiniband device, it should still work withous any issue, is this understanding correct?

Accordingly, we would like to seek professional help from GCP experts. Would you please kindly help clarify this, and more important is, what is the best configuration for using Slurm on GCP to run Ansys Fluent?

Steps to reproduce

Steps to reproduce the behavior:

Expected behavior

A clear and concise description of what you expected to happen.

Actual behavior

What happened instead.

Version (ghpc --version)

Blueprint

If applicable, attach or paste the blueprint YAML used to produce the bug.

Expanded Blueprint

If applicable, please attach or paste the expanded blueprint. The expanded blueprint can be obtained by running ghpc expand your-blueprint.yaml.

Disregard if the bug occurs when running ghpc expand ... as well.

Output and logs


Screenshots

If applicable, add screenshots to help explain your problem.

Execution environment

  • OS: [macOS, ubuntu, ...]
  • Shell (To find this, run ps -p $$): [bash, zsh, ...]
  • go version:

Additional context

Add any other context about the problem here.

Filesystems - Invalid options in Performance tier drop down menu when creating new Filestore instance

When creating new filestore instance, options for STANDARD and PREMIUM exists in the drop down menu, which should be removed as they are no longer valid params.

Terraform log output

`
Error: Invalid value for variable

on main.tf line 20, in module "htk-fe-filestore":
20: filestore_tier = "STANDARD"
├────────────────
│ var.filestore_tier is "STANDARD"

The preferred name for STANDARD tier is now BASIC_HDD
https://cloud.google.com/filestore/docs/reference/rest/v1beta1/Tier.

This was checked by the validation rule at
modules/filestore/variables.tf:74,3-13.

Error: Invalid value for variable

on main.tf line 20, in module "htk-fe-filestore":
20: filestore_tier = "STANDARD"
├────────────────
│ var.filestore_tier is "STANDARD"

Allowed values for filestore_tier are
'BASIC_HDD','BASIC_SSD','HIGH_SCALE_SSD','ENTERPRISE'.
https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/filestore_instance#tier
https://cloud.google.com/filestore/docs/reference/rest/v1beta1/Tier.

This was checked by the validation rule at
modules/filestore/variables.tf:82,3-13.
`

C2D Compact Placement Policy not working

Slurm won't allow to deploy a compute partition using the Compact Placement Policy together with C2D instance type.

Follows the Slurm error

2022-03-21 10:47:25,970 2945 47377888545088 resume.py ERROR: Unsupported placement policy configuration. Please utilize c2 machine type.

Follows the HPC-Toolkit Compute Partition definition

  # Slurm Compute partition
  - source: resources/third-party/compute/SchedMD-slurm-on-gcp-partition
    kind: terraform
    id: compute_partition
    use:
    - network1
    - homefs
    settings:
      partition_name: compute
      max_node_count: 4
      enable_placement: true
      machine_type: "c2d-highcpu-112"
      compute_disk_type: "pd-balanced"
      image_hyperthreads: false

This happens due to an hardcoded condition in Slurm that allows the Compact Placement Policy to work only with C2 instances.
https://github.com/SchedMD/slurm-gcp/blob/master/scripts/resume.py#L357-L363

If invalid region is specified, vpc module cannot be destroyed after partial apply

The following code in resources/network/vpc/main.tf

data "google_compute_subnetwork" "primary_subnetwork" {
depends_on = [
module.vpc
]
name = local.network_name
region = local.region
project = local.project_id
}

results in a scenario where a user can:

  1. specify an invalid region
  2. successfully run terraform plan
  3. run terraform apply with partial success (due to invalid region)
  4. cannot run terraform destroy to recover

The underlying issue is that the VPC module (the "remote" VPC module, not the one internal to the Toolkit) is a global resource and does not use var.region as an input. It is therefore provisioned without failure. However, the google_compute_subnetwork data source is constructed such that it depends upon module.vpc. This is a bit of a clumsy construction because data sources are, by default, queried early during the plan stage. They will fail if the result of the query has 0 values. Such a failure causes plan to fail and therefore also causes apply and destroy operations to fail.

Code should essentially never be written such that a data source can have 0 results unless you really intend it to cause absolutely everything to come to a screeching halt.

There is also something of an issue here that we are implicitly relying on the behavior of auto_create_subnetworks to create 1 subnetwork per region because the google_compute_subnetwork data source probably also fails if it finds multiple subnetworks in a given region (a perfectly desirable outcome).

My first thought here, is to ensure that data sources never use depends_on and to rely on a local variable that searches the keys of module.vpc.subnets (subnet_region/subnet_name) and takes the first one.

Give a short summary of changes on ghpc deploy/destroy

Right now ghpc output is too verbose. Destroying a cluster like this yaml prints >40 pages of changes (my tmux ran out of buffer at 40 pages).

Ideally there's be a way to see what exactly is being destroyed that fits on a couple of screens

SSH-dependent Packer provisioners not working with `schedmd-v5-slurm` images.

Describe the bug

When using the image-builder.yaml, if I supply custom shell_scripts arguments to packer build (or an Ansible conf, or anything that needs SSH) I get SSH connection failures. If I replace the schedmd-v5-slurm-22-05-8-hpc-centos-7 image family, with something like ubuntu-2210-amd64 SSH works fine. If I don't use SSH-dependent commands, the image build also works fine.

Steps to reproduce

Steps to reproduce the behavior:

  1. Generate the image-builder.yaml build files with ghcp.
  2. touch echo.sh and modify the packer/custom-image/defaults.auto.pkrvars.hcl to contain
shell_scripts = ["echo.sh"]
  1. Follow the docs to build the image.

Expected behavior

I expect SSH to work and run the shell Packer provisioners.

Actual behavior

SSH does not work.

Version (ghpc --version)

Blueprint

image-builder.yaml

Output and logs

Build 'image-builder-001.googlecompute.toolkit_image' errored after 2 minutes 40 seconds: Packer experienced an authentication error when trying to connect via SSH. This can happen if your username/password are wrong. You may want to double-check your credentials as part of your debugging process. original error: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain

Screenshots

If applicable, add screenshots to help explain your problem.

Execution environment

  • OS: [macOS]
  • Shell: [zsh]
  • go version: go version go1.20.2 darwin/arm64

Additional context

Before using hpc-toolkit, I was just using Packer independently but could not get Ansible provisioning + these slurm images to work. I get the same on a very minimal Packer+shell_scripts+schedmd-v5-slurm repo, too (where it also just goes away with plain ubuntu). Any pointers appreciated. Thanks.

HPC toolkit no longer works with a2 instances

Describe the bug

The controller node fails to initialise Slurm with node groups of a2 instance types (e.g. a2-highgpu-8g - where the suffix of the instance name refers to the number of GPUs and not the number of CPUs), failing with the logs:

2023-08-07 12:04:08,247 INFO: Setting up controller
2023-08-07 12:04:08,249 INFO: installing custom scripts: compute.d/ghpc_startup.sh,controller.d/ghpc_startup.sh,partition.d/a2/ghpc_startup.sh,partition.d/dev/ghpc_startup.sh
2023-08-07 12:04:08,249 DEBUG: install_custom_scripts: compute.d/ghpc_startup.sh
2023-08-07 12:04:08,251 DEBUG: install_custom_scripts: controller.d/ghpc_startup.sh
2023-08-07 12:04:08,252 DEBUG: install_custom_scripts: partition.d/a2/ghpc_startup.sh
2023-08-07 12:04:08,254 DEBUG: install_custom_scripts: partition.d/dev/ghpc_startup.sh
2023-08-07 12:04:08,259 DEBUG: compute_service: Using version=v1 of Google Compute Engine API
2023-08-07 12:04:44,512 WARNING: core count in machine type a2-highgpu-8g is not an integer. Default to 1 socket.
2023-08-07 12:04:44,512 ERROR: invalid literal for int() with base 10: '8g'
--
  File "/slurm/scripts/setup.py", line 1071, in setup_controller
    gen_cloud_conf()
  File "/slurm/scripts/setup.py", line 341, in gen_cloud_conf
    content = make_cloud_conf(lkp, cloud_parameters=cloud_parameters)
  File "/slurm/scripts/setup.py", line 330, in make_cloud_conf
    lines = [
  File "/slurm/scripts/setup.py", line 333, in <genexpr>
    *(partitionlines(p, lkp) for p in lkp.cfg.partitions.values()),
  File "/slurm/scripts/setup.py", line 272, in partitionlines
    group_lines = [
  File "/slurm/scripts/setup.py", line 273, in <listcomp>
    node_group_lines(group, part_name, lkp)
  File "/slurm/scripts/setup.py", line 210, in node_group_lines
    machine_conf = lkp.template_machine_conf(node_group.instance_template)
  File "/slurm/scripts/util.py", line 1540, in template_machine_conf
    _div = 2 if getThreadsPerCore(template) == 1 else 1
  File "/slurm/scripts/util.py", line 1117, in getThreadsPerCore
    if not isSmt(template):
  File "/slurm/scripts/util.py", line 1099, in isSmt
    machineTypeCore: int = int(matches["core"])
ValueError: invalid literal for int() with base 10: '8g'

Related commits:

Steps to reproduce

Steps to reproduce the behavior:

  1. Use an example blueprint that uses a2 instances, e.g. https://github.com/GoogleCloudPlatform/hpc-toolkit/blob/main/examples/ml-slurm.yaml
  2. ./ghpc create ml-slurm.yaml
  3. ./ghpc deploy ml-example
  4. log into the controller and inspect /slurm/scripts/setup.log

Expected behavior

The cluster to successfully initialise

Actual behavior

An uninitialised cluster with no ability to schedule work

Execution environment

  • OS: macOS
  • Shell: zsh
  • go version: 1.20.6

Additional context

Upstream bug:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.