Git Product home page Git Product logo

Comments (12)

tpdownes avatar tpdownes commented on August 28, 2024

I believe the log message you have reported is a spurious error. A user packer is created while building the Slurm image that is mostly, but not 100%, removed before creating the image itself. This is the GCE Guest Agent fully removing the user but finding certain directories missing already. It should not influence Slurm boot.

That said, your machine is not joining the pool. I might suggest a more extensive look at the logs for it. This command will show you startup script logs:

gcloud logging --project prj-n-005-cloudops-618d read 'logName="projects/prj-n-005-cloudops-618d/logs/GCEMetadataScripts" AND resource.labels.instance_id="3405041953608146457"' --format="table(timestamp, jsonPayload.message)" --freshness 48h | tac

This will show all logs associated with the VM:

gcloud logging --project prj-n-005-cloudops-618d read 'resource.labels.instance_id="3405041953608146457"' --format="table(timestamp, jsonPayload.message)" --freshness 48h | tac

Consider changing 48h to a value appropriate to when the machine was active.

from hpc-toolkit.

tpdownes avatar tpdownes commented on August 28, 2024

The packer user error has previously been reported to SchedMD (who publish the Slurm image used by your tutorial) and they anticipate resolving it in a near-term release.

from hpc-toolkit.

sharif-cameco avatar sharif-cameco commented on August 28, 2024

issharif_c@cloudshell:~ (prj-n-005-cloudops-618d)$ gcloud logging --project prj-n-005-cloudops-618d read 'logName="projects/prj-n-005-cloudops-618d/logs/GCEMetadataScripts" AND resource.labels.instance_id="3405041953608146457"' --format="table(timestamp, jsonPayload.message)" --freshness 24h | tac
MESSAGE: Starting startup scripts (version 20220713.00).
TIMESTAMP: 2023-07-13T16:06:36.505847771Z

MESSAGE: Found startup-script in metadata.
TIMESTAMP: 2023-07-13T16:06:36.527522434Z

MESSAGE: startup-script: ping -q -w1 -c1 metadata.google.internal
TIMESTAMP: 2023-07-13T16:06:36.542862802Z

MESSAGE: startup-script: Successfully contacted metadata server
TIMESTAMP: 2023-07-13T16:06:36.583101311Z

MESSAGE: startup-script: ping -q -w1 -c1 8.8.8.8
TIMESTAMP: 2023-07-13T16:06:36.583551646Z

MESSAGE: startup-script: failed to ping Google DNS, will retry
TIMESTAMP: 2023-07-13T16:06:37.588137365Z

MESSAGE: startup-script: failed to ping Google DNS, will retry
TIMESTAMP: 2023-07-13T16:06:40.624263840Z

MESSAGE: startup-script: failed to ping Google DNS, will retry
TIMESTAMP: 2023-07-13T16:06:43.627927353Z

MESSAGE: startup-script: failed to ping Google DNS, will retry
TIMESTAMP: 2023-07-13T16:06:46.631542202Z

MESSAGE: startup-script: failed to ping Google DNS, will retry
TIMESTAMP: 2023-07-13T16:06:49.635413099Z

MESSAGE: startup-script: No internet access detected
TIMESTAMP: 2023-07-13T16:06:49.635466895Z

MESSAGE: startup-script: curl: (22) The requested URL returned error: 404 Not Found
TIMESTAMP: 2023-07-13T16:06:49.707748850Z

MESSAGE: startup-script: hpcsmall-slurm-devel not found in project metadata, skipping script update
TIMESTAMP: 2023-07-13T16:06:49.708746575Z

MESSAGE: startup-script: running python cluster setup script
TIMESTAMP: 2023-07-13T16:06:49.710040510Z

MESSAGE: startup-script: INFO:googleapiclient.discovery_cache:file_cache is only supported with oauth2client<4.0.0
TIMESTAMP: 2023-07-13T16:06:52.116983239Z

MESSAGE: startup-script: ERROR:main:config file not found: /slurm/scripts/config.yaml
TIMESTAMP: 2023-07-13T16:06:52.206827654Z

MESSAGE: startup-script: WARNING:main:/slurm/scripts/config.yaml not found
TIMESTAMP: 2023-07-13T16:06:52.207785960Z

MESSAGE: startup-script: INFO:googleapiclient.discovery_cache:file_cache is only supported with oauth2client<4.0.0
TIMESTAMP: 2023-07-13T16:06:52.628300456Z

MESSAGE: startup-script: ERROR: Error while getting metadata from http://metadata.google.internal/computeMetadata/v1/project/attributes/hpcsmall-slurm-devel
TIMESTAMP: 2023-07-13T16:06:52.853972964Z

MESSAGE: startup-script: INFO: Setting up compute
TIMESTAMP: 2023-07-13T16:06:52.857948832Z

MESSAGE: startup-script: INFO: installing custom scripts:
TIMESTAMP: 2023-07-13T16:06:52.861560091Z

MESSAGE: startup-script: INFO: Set up network storage
TIMESTAMP: 2023-07-13T16:06:52.897747198Z

MESSAGE: startup-script: INFO: Setting up mount (nfs) 10.165.1.2:/nfsshare to /home
TIMESTAMP: 2023-07-13T16:06:52.897797956Z

MESSAGE: startup-script: INFO: Setting up mount (nfs) hpcsmall-controller:/usr/local/etc/slurm to /usr/local/etc/slurm
TIMESTAMP: 2023-07-13T16:06:52.897816387Z

MESSAGE: startup-script: INFO: Setting up mount (nfs) hpcsmall-controller:/etc/munge to /etc/munge
TIMESTAMP: 2023-07-13T16:06:52.897829486Z

MESSAGE: startup-script: INFO: Setting up mount (nfs) hpcsmall-controller:/opt/apps to /opt/apps
TIMESTAMP: 2023-07-13T16:06:52.897856954Z

MESSAGE: startup-script: DEBUG: : disabling prometheus support
TIMESTAMP: 2023-07-13T16:06:52.946370917Z

MESSAGE: startup-script: Traceback (most recent call last):
TIMESTAMP: 2023-07-13T16:06:52.946422935Z

MESSAGE: startup-script: File "/usr/local/lib/python3.6/site-packages/more_executors/_impl/metrics/init.py", line 15, in
TIMESTAMP: 2023-07-13T16:06:52.946436046Z

MESSAGE: startup-script: from .prometheus import PrometheusMetrics
TIMESTAMP: 2023-07-13T16:06:52.946450616Z

MESSAGE: startup-script: File "/usr/local/lib/python3.6/site-packages/more_executors/_impl/metrics/prometheus.py", line 3, in
TIMESTAMP: 2023-07-13T16:06:52.946462044Z

MESSAGE: startup-script: import prometheus_client # pylint: disable=import-error
TIMESTAMP: 2023-07-13T16:06:52.946475497Z

MESSAGE: startup-script: ModuleNotFoundError: No module named 'prometheus_client'
TIMESTAMP: 2023-07-13T16:06:52.946488674Z

MESSAGE: startup-script: INFO: Waiting for '/home' to be mounted...
TIMESTAMP: 2023-07-13T16:06:53.053302125Z

MESSAGE: startup-script: INFO: Waiting for '/usr/local/etc/slurm' to be mounted...
TIMESTAMP: 2023-07-13T16:06:53.056804451Z

MESSAGE: startup-script: INFO: Waiting for '/etc/munge' to be mounted...
TIMESTAMP: 2023-07-13T16:06:53.063098357Z

MESSAGE: startup-script: INFO: Waiting for '/opt/apps' to be mounted...
TIMESTAMP: 2023-07-13T16:06:53.067950750Z

MESSAGE: startup-script: INFO: Mount point '/opt/apps' was mounted.
TIMESTAMP: 2023-07-13T16:06:53.351928114Z

MESSAGE: startup-script: INFO: Mount point '/etc/munge' was mounted.
TIMESTAMP: 2023-07-13T16:06:53.356839746Z

MESSAGE: startup-script: INFO: Mount point '/usr/local/etc/slurm' was mounted.
TIMESTAMP: 2023-07-13T16:06:53.362189366Z

MESSAGE: startup-script: INFO: Mount point '/home' was mounted.
TIMESTAMP: 2023-07-13T16:06:56.358536552Z
issharif_c@cloudshell:~ (prj-n-005-cloudops-618d)$

from hpc-toolkit.

sharif-cameco avatar sharif-cameco commented on August 28, 2024

I ran the following command

gcloud logging --project prj-n-005-cloudops-618d read 'resource.labels.instance_id="3405041953608146457"' --format="table(timestamp, jsonPayload.message)" --freshness 24h | tac

Output is too long, putting the parts I feel having a clue for you.

NB: the vm has internet but the ICMP is blocked.

from hpc-toolkit.

sharif-cameco avatar sharif-cameco commented on August 28, 2024

MESSAGE: Jul 13 16:06:30 hpcsmall-debug-ghpc-0 systemd-hostnamed: Changed static host name to 'hpcsmall-debug-ghpc-0'
TIMESTAMP: 2023-07-13T16:06:33.900240788Z

MESSAGE: Jul 13 16:06:30 hpcsmall-debug-ghpc-0 NetworkManager[497]: [1689264390.6854] hostname: hostname changed from "schedmd-v5-slurm-22-05-4-hpc-centos-7-1665675565" to "hpcsmall-debug-ghpc-0"
TIMESTAMP: 2023-07-13T16:06:33.900240936Z

MESSAGE: Jul 13 16:06:30 hpcsmall-debug-ghpc-0 NetworkManager[497]: [1689264390.6855] policy: set-hostname: set hostname to 'hpcsmall-debug-ghpc-0' (from system configuration)
TIMESTAMP: 2023-07-13T16:06:33.900241099Z

MESSAGE: Jul 13 16:06:30 hpcsmall-debug-ghpc-0 nm-dispatcher: req:5 'hostname': new request (4 scripts)
TIMESTAMP: 2023-07-13T16:06:33.900241269Z

MESSAGE: Jul 13 16:06:30 hpcsmall-debug-ghpc-0 systemd-hostnamed: Changed host name to 'hpcsmall-debug-ghpc-0'
TIMESTAMP: 2023-07-13T16:06:33.900241388Z

MESSAGE: Jul 13 16:06:30 hpcsmall-debug-ghpc-0 nm-dispatcher: req:6 'hostname': new request (4 scripts)
TIMESTAMP: 2023-07-13T16:06:33.900241541Z

MESSAGE: Jul 13 16:06:30 hpcsmall-debug-ghpc-0 nm-dispatcher: req:4 'connectivity-change': start running ordered scripts...
TIMESTAMP: 2023-07-13T16:06:33.900241622Z

MESSAGE: Jul 13 16:06:30 hpcsmall-debug-ghpc-0 nm-dispatcher: req:5 'hostname': start running ordered scripts...
TIMESTAMP: 2023-07-13T16:06:33.900241739Z

MESSAGE: Jul 13 16:06:30 hpcsmall-debug-ghpc-0 nm-dispatcher: req:6 'hostname': start running ordered scripts...
TIMESTAMP: 2023-07-13T16:06:33.900241849Z

MESSAGE: Jul 13 16:06:30 hpcsmall-debug-ghpc-0 network: Bringing up loopback interface: [ OK ]
TIMESTAMP: 2023-07-13T16:06:33.900241959Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 network: Bringing up interface eth0: [ OK ]
TIMESTAMP: 2023-07-13T16:06:33.900242069Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: Started LSB: Bring up/down networking.
TIMESTAMP: 2023-07-13T16:06:33.900242176Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: Reached target Network.
TIMESTAMP: 2023-07-13T16:06:33.900242301Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: Starting Dynamic System Tuning Daemon...
TIMESTAMP: 2023-07-13T16:06:33.900242407Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: Reached target Network is Online.
TIMESTAMP: 2023-07-13T16:06:33.900242519Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: Starting System Logging Service...
TIMESTAMP: 2023-07-13T16:06:33.900242630Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: Starting Google Cloud Ops Agent - Logging Agent...
TIMESTAMP: 2023-07-13T16:06:33.900242748Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: Starting Google Cloud Ops Agent - Metrics Agent...
TIMESTAMP: 2023-07-13T16:06:33.900242874Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: Starting NFS Mount Daemon...
TIMESTAMP: 2023-07-13T16:06:33.900242981Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: Starting NFS status monitor for NFSv2/3 locking....
TIMESTAMP: 2023-07-13T16:06:33.900243099Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: Started Google OSConfig Agent.
TIMESTAMP: 2023-07-13T16:06:33.900243272Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: Reached target Login Prompts.
TIMESTAMP: 2023-07-13T16:06:33.900243429Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: Starting Postfix Mail Transport Agent...
TIMESTAMP: 2023-07-13T16:06:33.900243536Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: Starting MUNGE authentication service...
TIMESTAMP: 2023-07-13T16:06:33.900243656Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: Starting Google Compute Engine Guest Agent...
TIMESTAMP: 2023-07-13T16:06:33.900243863Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 rpc.statd[887]: Version 1.3.0 starting
TIMESTAMP: 2023-07-13T16:06:33.900243985Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 rpc.statd[887]: Flags: TI-RPC
TIMESTAMP: 2023-07-13T16:06:33.900244156Z

MESSAGE: GCE Agent Started (version 20220713.00)
TIMESTAMP: 2023-07-13T16:06:34.298349809Z

MESSAGE: Instance ID changed, running first-boot actions
TIMESTAMP: 2023-07-13T16:06:34.557871751Z

MESSAGE: OSConfig Agent (version 20220824.00-g1.el7) started.
TIMESTAMP: 2023-07-13T16:06:34.728212732Z

MESSAGE: Enabling OS Login
TIMESTAMP: 2023-07-13T16:06:34.946793779Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 rsyslogd: [origin software="rsyslogd" swVersion="8.24.0-57.el7_9.3" x-pid="872" x-info="http://www.rsyslog.com"] start
TIMESTAMP: 2023-07-13T16:06:35.131890351Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: Started System Logging Service.
TIMESTAMP: 2023-07-13T16:06:35.131896822Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: Starting Google Compute Engine Shutdown Scripts...
TIMESTAMP: 2023-07-13T16:06:35.131897135Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 munged: munged: Error: Failed to check keyfile "/etc/munge/munge.key": No such file or directory
TIMESTAMP: 2023-07-13T16:06:35.131897340Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: munge.service: control process exited, code=exited status=1
TIMESTAMP: 2023-07-13T16:06:35.131897513Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: Failed to start MUNGE authentication service.
TIMESTAMP: 2023-07-13T16:06:35.131897708Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: Unit munge.service entered failed state.
TIMESTAMP: 2023-07-13T16:06:35.131897871Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: munge.service failed.
TIMESTAMP: 2023-07-13T16:06:35.131898049Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: Started Google Compute Engine Shutdown Scripts.
TIMESTAMP: 2023-07-13T16:06:35.131898183Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: 2023/07/13 16:06:31 Built-in config:
TIMESTAMP: 2023-07-13T16:06:35.131898414Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: logging:
TIMESTAMP: 2023-07-13T16:06:35.131898605Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: receivers:
TIMESTAMP: 2023-07-13T16:06:35.131898779Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: syslog:
TIMESTAMP: 2023-07-13T16:06:35.131898951Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: type: files
TIMESTAMP: 2023-07-13T16:06:35.131899127Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: include_paths:
TIMESTAMP: 2023-07-13T16:06:35.131899259Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: - /var/log/messages
TIMESTAMP: 2023-07-13T16:06:35.131899385Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: - /var/log/syslog
TIMESTAMP: 2023-07-13T16:06:35.131899534Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: service:
TIMESTAMP: 2023-07-13T16:06:35.131899641Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: pipelines:
TIMESTAMP: 2023-07-13T16:06:35.131899738Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: default_pipeline:
TIMESTAMP: 2023-07-13T16:06:35.131899879Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: receivers: [syslog]
TIMESTAMP: 2023-07-13T16:06:35.131899996Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: metrics:
TIMESTAMP: 2023-07-13T16:06:35.131900137Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: receivers:
TIMESTAMP: 2023-07-13T16:06:35.131900281Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: hostmetrics:
TIMESTAMP: 2023-07-13T16:06:35.131900404Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: type: hostmetrics
TIMESTAMP: 2023-07-13T16:06:35.131900530Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: collection_interval: 60s
TIMESTAMP: 2023-07-13T16:06:35.131900706Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: processors:
TIMESTAMP: 2023-07-13T16:06:35.131900819Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: metrics_filter:
TIMESTAMP: 2023-07-13T16:06:35.131900986Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: type: exclude_metrics
TIMESTAMP: 2023-07-13T16:06:35.131901164Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: metrics_pattern: []
TIMESTAMP: 2023-07-13T16:06:35.131901276Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: service:
TIMESTAMP: 2023-07-13T16:06:35.131901424Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: pipelines:
TIMESTAMP: 2023-07-13T16:06:35.131901611Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: default_pipeline:
TIMESTAMP: 2023-07-13T16:06:35.131901724Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: receivers: [hostmetrics]
TIMESTAMP: 2023-07-13T16:06:35.131901839Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: processors: [metrics_filter]
TIMESTAMP: 2023-07-13T16:06:35.131901936Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: 2023/07/13 16:06:31 Built-in config:
TIMESTAMP: 2023-07-13T16:06:35.131902059Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: logging:
TIMESTAMP: 2023-07-13T16:06:35.131902184Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: receivers:
TIMESTAMP: 2023-07-13T16:06:35.131902272Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: syslog:
TIMESTAMP: 2023-07-13T16:06:35.131902379Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: type: files
TIMESTAMP: 2023-07-13T16:06:35.131902481Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: include_paths:
TIMESTAMP: 2023-07-13T16:06:35.131902591Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: - /var/log/messages
TIMESTAMP: 2023-07-13T16:06:35.131902706Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: - /var/log/syslog
TIMESTAMP: 2023-07-13T16:06:35.131902820Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: service:
TIMESTAMP: 2023-07-13T16:06:35.131902940Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: pipelines:
TIMESTAMP: 2023-07-13T16:06:35.131903045Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: default_pipeline:
TIMESTAMP: 2023-07-13T16:06:35.131903160Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: receivers: [syslog]
TIMESTAMP: 2023-07-13T16:06:35.131903267Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: metrics:
TIMESTAMP: 2023-07-13T16:06:35.131903369Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: receivers:
TIMESTAMP: 2023-07-13T16:06:35.131903479Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: hostmetrics:
TIMESTAMP: 2023-07-13T16:06:35.131903574Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: type: hostmetrics
TIMESTAMP: 2023-07-13T16:06:35.131903706Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: collection_interval: 60s
TIMESTAMP: 2023-07-13T16:06:35.131903826Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: processors:
TIMESTAMP: 2023-07-13T16:06:35.131903969Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: metrics_filter:
TIMESTAMP: 2023-07-13T16:06:35.131904082Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: type: exclude_metrics
TIMESTAMP: 2023-07-13T16:06:35.131904193Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: metrics_pattern: []
TIMESTAMP: 2023-07-13T16:06:35.131904289Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: service:
TIMESTAMP: 2023-07-13T16:06:35.131904386Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: pipelines:
TIMESTAMP: 2023-07-13T16:06:35.131904489Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: default_pipeline:
TIMESTAMP: 2023-07-13T16:06:35.131904596Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: receivers: [hostmetrics]
TIMESTAMP: 2023-07-13T16:06:35.131904715Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: processors: [metrics_filter]
TIMESTAMP: 2023-07-13T16:06:35.131904851Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: 2023/07/13 16:06:31 Merged config:
TIMESTAMP: 2023-07-13T16:06:35.131904943Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: logging:
TIMESTAMP: 2023-07-13T16:06:35.131905047Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: receivers:
TIMESTAMP: 2023-07-13T16:06:35.131905155Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: mysql_error:
TIMESTAMP: 2023-07-13T16:06:35.131905308Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: type: mysql_error
TIMESTAMP: 2023-07-13T16:06:35.131905440Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: mysql_general:
TIMESTAMP: 2023-07-13T16:06:35.131905587Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: type: mysql_general
TIMESTAMP: 2023-07-13T16:06:35.131906112Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: mysql_slow:
TIMESTAMP: 2023-07-13T16:06:35.131906212Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: type: mysql_slow
TIMESTAMP: 2023-07-13T16:06:35.131906361Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: slurm_resume:
TIMESTAMP: 2023-07-13T16:06:35.131906524Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: type: files
TIMESTAMP: 2023-07-13T16:06:35.131906624Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: include_paths:
TIMESTAMP: 2023-07-13T16:06:35.131906751Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: - /var/log/slurm/resume.log
TIMESTAMP: 2023-07-13T16:06:35.131906892Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: slurm_suspend:
TIMESTAMP: 2023-07-13T16:06:35.131907046Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: type: files
TIMESTAMP: 2023-07-13T16:06:35.131907193Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: include_paths:
TIMESTAMP: 2023-07-13T16:06:35.131907618Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: - /var/log/slurm/suspend.log
TIMESTAMP: 2023-07-13T16:06:35.131907726Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: slurm_sync:
TIMESTAMP: 2023-07-13T16:06:35.131907876Z

MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: type: files
TIMESTAMP: 2023-07-13T16:06:35.131908001Z

from hpc-toolkit.

sharif-cameco avatar sharif-cameco commented on August 28, 2024

MESSAGE: startup-script: failed to ping Google DNS, will retry
TIMESTAMP: 2023-07-13T16:06:40.624263840Z

MESSAGE: Jul 13 16:06:40 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: failed to ping Google DNS, will retry
TIMESTAMP: 2023-07-13T16:06:40.624930279Z

MESSAGE: startup-script: failed to ping Google DNS, will retry
TIMESTAMP: 2023-07-13T16:06:43.627927353Z

MESSAGE: Jul 13 16:06:43 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: failed to ping Google DNS, will retry
TIMESTAMP: 2023-07-13T16:06:43.628501787Z

MESSAGE: startup-script: failed to ping Google DNS, will retry
TIMESTAMP: 2023-07-13T16:06:46.631542202Z

MESSAGE: Jul 13 16:06:46 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: failed to ping Google DNS, will retry
TIMESTAMP: 2023-07-13T16:06:46.632214565Z

MESSAGE: startup-script: failed to ping Google DNS, will retry
TIMESTAMP: 2023-07-13T16:06:49.635413099Z

MESSAGE: startup-script: No internet access detected
TIMESTAMP: 2023-07-13T16:06:49.635466895Z

MESSAGE: Jul 13 16:06:49 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: failed to ping Google DNS, will retry
TIMESTAMP: 2023-07-13T16:06:49.636191831Z

MESSAGE: Jul 13 16:06:49 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: No internet access detected
TIMESTAMP: 2023-07-13T16:06:49.636194602Z

MESSAGE: startup-script: curl: (22) The requested URL returned error: 404 Not Found
TIMESTAMP: 2023-07-13T16:06:49.707748850Z

MESSAGE: Jul 13 16:06:49 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: curl: (22) The requested URL returned error: 404 Not Found
TIMESTAMP: 2023-07-13T16:06:49.708601135Z

MESSAGE: startup-script: hpcsmall-slurm-devel not found in project metadata, skipping script update
TIMESTAMP: 2023-07-13T16:06:49.708746575Z

MESSAGE: Jul 13 16:06:49 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: hpcsmall-slurm-devel not found in project metadata, skipping script update
TIMESTAMP: 2023-07-13T16:06:49.709074158Z

MESSAGE: startup-script: running python cluster setup script
TIMESTAMP: 2023-07-13T16:06:49.710040510Z

MESSAGE: Jul 13 16:06:49 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: running python cluster setup script
TIMESTAMP: 2023-07-13T16:06:49.710338610Z

MESSAGE: startup-script: INFO:googleapiclient.discovery_cache:file_cache is only supported with oauth2client<4.0.0
TIMESTAMP: 2023-07-13T16:06:52.116983239Z

MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: INFO:googleapiclient.discovery_cache:file_cache is only supported with oauth2client<4.0.0
TIMESTAMP: 2023-07-13T16:06:52.117566982Z

MESSAGE: startup-script: ERROR:main:config file not found: /slurm/scripts/config.yaml
TIMESTAMP: 2023-07-13T16:06:52.206827654Z

MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: ERROR:main:config file not found: /slurm/scripts/config.yaml
TIMESTAMP: 2023-07-13T16:06:52.207516329Z

MESSAGE: startup-script: WARNING:main:/slurm/scripts/config.yaml not found
TIMESTAMP: 2023-07-13T16:06:52.207785960Z

MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: WARNING:main:/slurm/scripts/config.yaml not found
TIMESTAMP: 2023-07-13T16:06:52.208123517Z

MESSAGE: startup-script: INFO:googleapiclient.discovery_cache:file_cache is only supported with oauth2client<4.0.0
TIMESTAMP: 2023-07-13T16:06:52.628300456Z

MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: INFO:googleapiclient.discovery_cache:file_cache is only supported with oauth2client<4.0.0
TIMESTAMP: 2023-07-13T16:06:52.628872672Z

MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 wall[1596]: wall: user root broadcasted 1 lines (64 chars)
TIMESTAMP: 2023-07-13T16:06:52.834119224Z

MESSAGE: startup-script: ERROR: Error while getting metadata from http://metadata.google.internal/computeMetadata/v1/project/attributes/hpcsmall-slurm-devel
TIMESTAMP: 2023-07-13T16:06:52.853972964Z

MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: ERROR: Error while getting metadata from http://metadata.google.internal/computeMetadata/v1/project/attributes/hpcsmall-slurm-devel
TIMESTAMP: 2023-07-13T16:06:52.854476176Z

MESSAGE: startup-script: INFO: Setting up compute
TIMESTAMP: 2023-07-13T16:06:52.857948832Z

MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: INFO: Setting up compute
TIMESTAMP: 2023-07-13T16:06:52.858322163Z

MESSAGE: startup-script: INFO: installing custom scripts:
TIMESTAMP: 2023-07-13T16:06:52.861560091Z

MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: INFO: installing custom scripts:
TIMESTAMP: 2023-07-13T16:06:52.861886268Z

MESSAGE: startup-script: INFO: Set up network storage
TIMESTAMP: 2023-07-13T16:06:52.897747198Z

MESSAGE: startup-script: INFO: Setting up mount (nfs) 10.165.1.2:/nfsshare to /home
TIMESTAMP: 2023-07-13T16:06:52.897797956Z

MESSAGE: startup-script: INFO: Setting up mount (nfs) hpcsmall-controller:/usr/local/etc/slurm to /usr/local/etc/slurm
TIMESTAMP: 2023-07-13T16:06:52.897816387Z

MESSAGE: startup-script: INFO: Setting up mount (nfs) hpcsmall-controller:/etc/munge to /etc/munge
TIMESTAMP: 2023-07-13T16:06:52.897829486Z

MESSAGE: startup-script: INFO: Setting up mount (nfs) hpcsmall-controller:/opt/apps to /opt/apps
TIMESTAMP: 2023-07-13T16:06:52.897856954Z

MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: INFO: Set up network storage
TIMESTAMP: 2023-07-13T16:06:52.898968986Z

MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: INFO: Setting up mount (nfs) 10.165.1.2:/nfsshare to /home
TIMESTAMP: 2023-07-13T16:06:52.898971584Z

MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: INFO: Setting up mount (nfs) hpcsmall-controller:/usr/local/etc/slurm to /usr/local/etc/slurm
TIMESTAMP: 2023-07-13T16:06:52.898971871Z

MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: INFO: Setting up mount (nfs) hpcsmall-controller:/etc/munge to /etc/munge
TIMESTAMP: 2023-07-13T16:06:52.898972108Z

MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: INFO: Setting up mount (nfs) hpcsmall-controller:/opt/apps to /opt/apps
TIMESTAMP: 2023-07-13T16:06:52.898972441Z

MESSAGE: startup-script: DEBUG: : disabling prometheus support
TIMESTAMP: 2023-07-13T16:06:52.946370917Z

MESSAGE: startup-script: Traceback (most recent call last):
TIMESTAMP: 2023-07-13T16:06:52.946422935Z

MESSAGE: startup-script: File "/usr/local/lib/python3.6/site-packages/more_executors/_impl/metrics/init.py", line 15, in
TIMESTAMP: 2023-07-13T16:06:52.946436046Z

MESSAGE: startup-script: from .prometheus import PrometheusMetrics
TIMESTAMP: 2023-07-13T16:06:52.946450616Z

MESSAGE: startup-script: File "/usr/local/lib/python3.6/site-packages/more_executors/_impl/metrics/prometheus.py", line 3, in
TIMESTAMP: 2023-07-13T16:06:52.946462044Z

MESSAGE: startup-script: import prometheus_client # pylint: disable=import-error
TIMESTAMP: 2023-07-13T16:06:52.946475497Z

MESSAGE: startup-script: ModuleNotFoundError: No module named 'prometheus_client'
TIMESTAMP: 2023-07-13T16:06:52.946488674Z

MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: DEBUG: : disabling prometheus support
TIMESTAMP: 2023-07-13T16:06:52.947799781Z

MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: Traceback (most recent call last):
TIMESTAMP: 2023-07-13T16:06:52.947804992Z

MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: File "/usr/local/lib/python3.6/site-packages/more_executors/_impl/metrics/init.py", line 15, in
TIMESTAMP: 2023-07-13T16:06:52.947805277Z

MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: from .prometheus import PrometheusMetrics
TIMESTAMP: 2023-07-13T16:06:52.947805470Z

MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: File "/usr/local/lib/python3.6/site-packages/more_executors/_impl/metrics/prometheus.py", line 3, in
TIMESTAMP: 2023-07-13T16:06:52.947805690Z

MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: import prometheus_client # pylint: disable=import-error
TIMESTAMP: 2023-07-13T16:06:52.947805920Z

MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: ModuleNotFoundError: No module named 'prometheus_client'
TIMESTAMP: 2023-07-13T16:06:52.947806110Z

MESSAGE: startup-script: INFO: Waiting for '/home' to be mounted...
TIMESTAMP: 2023-07-13T16:06:53.053302125Z

MESSAGE: Jul 13 16:06:53 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: INFO: Waiting for '/home' to be mounted...
TIMESTAMP: 2023-07-13T16:06:53.053821404Z

MESSAGE: startup-script: INFO: Waiting for '/usr/local/etc/slurm' to be mounted...
TIMESTAMP: 2023-07-13T16:06:53.056804451Z

MESSAGE: Jul 13 16:06:53 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: INFO: Waiting for '/usr/local/etc/slurm' to be mounted...
TIMESTAMP: 2023-07-13T16:06:53.057254745Z

MESSAGE: startup-script: INFO: Waiting for '/etc/munge' to be mounted...
TIMESTAMP: 2023-07-13T16:06:53.063098357Z

MESSAGE: Jul 13 16:06:53 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: INFO: Waiting for '/etc/munge' to be mounted...
TIMESTAMP: 2023-07-13T16:06:53.063471884Z

MESSAGE: startup-script: INFO: Waiting for '/opt/apps' to be mounted...
TIMESTAMP: 2023-07-13T16:06:53.067950750Z

MESSAGE: Jul 13 16:06:53 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: INFO: Waiting for '/opt/apps' to be mounted...
TIMESTAMP: 2023-07-13T16:06:53.068295350Z

MESSAGE: Jul 13 16:06:53 hpcsmall-debug-ghpc-0 kernel: FS-Cache: Loaded
TIMESTAMP: 2023-07-13T16:06:53.141143386Z

MESSAGE: Jul 13 16:06:53 hpcsmall-debug-ghpc-0 kernel: FS-Cache: Netfs 'nfs' registered for caching
TIMESTAMP: 2023-07-13T16:06:53.220549264Z

MESSAGE: Jul 13 16:06:53 hpcsmall-debug-ghpc-0 kernel: Key type dns_resolver registered
TIMESTAMP: 2023-07-13T16:06:53.259145310Z

MESSAGE: Jul 13 16:06:53 hpcsmall-debug-ghpc-0 kernel: NFS: Registering the id_resolver key type
TIMESTAMP: 2023-07-13T16:06:53.292165653Z

MESSAGE: Jul 13 16:06:53 hpcsmall-debug-ghpc-0 kernel: Key type id_resolver registered
TIMESTAMP: 2023-07-13T16:06:53.292169528Z

MESSAGE: Jul 13 16:06:53 hpcsmall-debug-ghpc-0 kernel: Key type id_legacy registered
TIMESTAMP: 2023-07-13T16:06:53.297774099Z

MESSAGE: startup-script: INFO: Mount point '/opt/apps' was mounted.
TIMESTAMP: 2023-07-13T16:06:53.351928114Z

MESSAGE: Jul 13 16:06:53 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: INFO: Mount point '/opt/apps' was mounted.
TIMESTAMP: 2023-07-13T16:06:53.352571723Z

MESSAGE: startup-script: INFO: Mount point '/etc/munge' was mounted.
TIMESTAMP: 2023-07-13T16:06:53.356839746Z

MESSAGE: Jul 13 16:06:53 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: INFO: Mount point '/etc/munge' was mounted.
TIMESTAMP: 2023-07-13T16:06:53.357466454Z

MESSAGE: startup-script: INFO: Mount point '/usr/local/etc/slurm' was mounted.
TIMESTAMP: 2023-07-13T16:06:53.362189366Z

MESSAGE: Jul 13 16:06:53 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: INFO: Mount point '/usr/local/etc/slurm' was mounted.
TIMESTAMP: 2023-07-13T16:06:53.362728110Z

MESSAGE: startup-script: INFO: Mount point '/home' was mounted.
TIMESTAMP: 2023-07-13T16:06:56.358536552Z

MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: INFO: Mount point '/home' was mounted.
TIMESTAMP: 2023-07-13T16:06:56.359568384Z

MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: DEBUG: run_custom_scripts: custom scripts to run: /slurm/custom_scripts/()
TIMESTAMP: 2023-07-13T16:06:56.599696464Z

MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 systemd: Starting MUNGE authentication service...
TIMESTAMP: 2023-07-13T16:06:56.633712278Z

MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 systemd: Started MUNGE authentication service.
TIMESTAMP: 2023-07-13T16:06:56.681955497Z

MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 systemd: Reloading.
TIMESTAMP: 2023-07-13T16:06:56.699265986Z

MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 systemd: [/usr/lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service:23] Unknown lvalue 'StateDirectory' in section 'Service'
TIMESTAMP: 2023-07-13T16:06:56.736840989Z

MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 systemd: [/usr/lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service:24] Unknown lvalue 'LogsDirectory' in section 'Service'
TIMESTAMP: 2023-07-13T16:06:56.737233032Z

MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 systemd: [/usr/lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service:30] Unknown lvalue 'RuntimeDirectoryPreserve' in section 'Service'
TIMESTAMP: 2023-07-13T16:06:56.737609049Z

MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 systemd: [/usr/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service:23] Unknown lvalue 'StateDirectory' in section 'Service'
TIMESTAMP: 2023-07-13T16:06:56.738059961Z

MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 systemd: [/usr/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service:24] Unknown lvalue 'LogsDirectory' in section 'Service'
TIMESTAMP: 2023-07-13T16:06:56.738410537Z

MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 systemd: [/usr/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service:30] Unknown lvalue 'RuntimeDirectoryPreserve' in section 'Service'
TIMESTAMP: 2023-07-13T16:06:56.738700889Z

MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 systemd: Started Slurm node daemon.
TIMESTAMP: 2023-07-13T16:06:56.769598605Z

MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 systemd: Reloading.
TIMESTAMP: 2023-07-13T16:06:56.819583430Z

MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 systemd: [/usr/lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service:23] Unknown lvalue 'StateDirectory' in section 'Service'
TIMESTAMP: 2023-07-13T16:06:56.851831316Z

MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 systemd: [/usr/lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service:24] Unknown lvalue 'LogsDirectory' in section 'Service'
TIMESTAMP: 2023-07-13T16:06:56.852293452Z

MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 systemd: [/usr/lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service:30] Unknown lvalue 'RuntimeDirectoryPreserve' in section 'Service'
TIMESTAMP: 2023-07-13T16:06:56.852662558Z

MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 systemd: [/usr/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service:23] Unknown lvalue 'StateDirectory' in section 'Service'
TIMESTAMP: 2023-07-13T16:06:56.853097143Z

MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 systemd: [/usr/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service:24] Unknown lvalue 'LogsDirectory' in section 'Service'
TIMESTAMP: 2023-07-13T16:06:56.853522462Z

MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 systemd: [/usr/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service:30] Unknown lvalue 'RuntimeDirectoryPreserve' in section 'Service'
TIMESTAMP: 2023-07-13T16:06:56.853828043Z

MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 systemd: Started Slurm Cluster Event Daemon.
TIMESTAMP: 2023-07-13T16:06:56.884289205Z

MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: INFO: Check status of cluster services
TIMESTAMP: 2023-07-13T16:06:56.916557223Z

MESSAGE: Jul 13 16:06:57 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: INFO: Done setting up compute
TIMESTAMP: 2023-07-13T16:06:57.082362712Z

MESSAGE: Jul 13 16:06:57 hpcsmall-debug-ghpc-0 wall[1712]: wall: user root broadcasted 1 lines (38 chars)
TIMESTAMP: 2023-07-13T16:06:57.087527857Z

MESSAGE: Jul 13 16:06:57 hpcsmall-debug-ghpc-0 wall[1714]: wall: user root broadcasted 4 lines (118 chars)
TIMESTAMP: 2023-07-13T16:06:57.097393110Z

MESSAGE: Jul 13 16:06:57 hpcsmall-debug-ghpc-0 slurmd: slurmd: slurmd version 22.05.4 started
TIMESTAMP: 2023-07-13T16:06:57.179327190Z

MESSAGE: Jul 13 16:06:57 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script exit status 0
TIMESTAMP: 2023-07-13T16:06:57.264507543Z

MESSAGE: Jul 13 16:06:57 hpcsmall-debug-ghpc-0 google_metadata_script_runner: Finished running startup scripts.
TIMESTAMP: 2023-07-13T16:06:57.264983531Z

MESSAGE: Jul 13 16:06:57 hpcsmall-debug-ghpc-0 systemd: Started Google Compute Engine Startup Scripts.
TIMESTAMP: 2023-07-13T16:06:57.269071781Z

MESSAGE: Jul 13 16:06:57 hpcsmall-debug-ghpc-0 systemd: Reached target Multi-User System.
TIMESTAMP: 2023-07-13T16:06:57.269073565Z

MESSAGE: Jul 13 16:06:57 hpcsmall-debug-ghpc-0 systemd: Starting Update UTMP about System Runlevel Changes...
TIMESTAMP: 2023-07-13T16:06:57.269073886Z

MESSAGE: Jul 13 16:06:57 hpcsmall-debug-ghpc-0 systemd: Started Update UTMP about System Runlevel Changes.
TIMESTAMP: 2023-07-13T16:06:57.284062159Z

MESSAGE: Jul 13 16:06:57 hpcsmall-debug-ghpc-0 systemd: Startup finished in 642ms (kernel) + 2.765s (initrd) + 33.008s (userspace) = 36.416s.
TIMESTAMP: 2023-07-13T16:06:57.284064657Z

MESSAGE: Jul 13 16:06:57 hpcsmall-debug-ghpc-0 slurmeventd.py: INFO:googleapiclient.discovery_cache:file_cache is only supported with oauth2client<4.0.0
TIMESTAMP: 2023-07-13T16:06:57.367946810Z

MESSAGE: Jul 13 16:06:57 hpcsmall-debug-ghpc-0 slurmd: slurmd: CPUs=1 Boards=1 Sockets=1 Cores=1 Threads=1 Memory=7818 TmpDisk=50988 Uptime=37 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
TIMESTAMP: 2023-07-13T16:06:57.378436736Z

MESSAGE: Jul 13 16:06:58 hpcsmall-debug-ghpc-0 slurmd: slurmd: launch task StepId=25.0 request from UID:2099065396 GID:2099065396 HOST:10.161.0.60 PORT:50428
TIMESTAMP: 2023-07-13T16:06:58.491820935Z

MESSAGE: Jul 13 16:06:58 hpcsmall-debug-ghpc-0 slurmd: slurmd: task/affinity: lllp_distribution: JobId=25 implicit auto binding: sockets,one_thread, dist 8192
TIMESTAMP: 2023-07-13T16:06:58.491826766Z

MESSAGE: Jul 13 16:06:58 hpcsmall-debug-ghpc-0 slurmd: slurmd: task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic
TIMESTAMP: 2023-07-13T16:06:58.491827112Z

MESSAGE: Jul 13 16:06:58 hpcsmall-debug-ghpc-0 slurmd: slurmd: task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [25]: mask_cpu,one_thread, 0x1
TIMESTAMP: 2023-07-13T16:06:58.491827397Z

MESSAGE:
TIMESTAMP: 2023-07-13T16:06:59.540723Z

MESSAGE: Jul 13 16:07:00 hpcsmall-debug-ghpc-0 systemd-logind: Power key pressed.
TIMESTAMP: 2023-07-13T16:07:00.302885280Z

MESSAGE: Jul 13 16:07:00 hpcsmall-debug-ghpc-0 systemd-logind: Powering Off...
TIMESTAMP: 2023-07-13T16:07:00.302888277Z

MESSAGE: Jul 13 16:07:00 hpcsmall-debug-ghpc-0 systemd-logind: System is powering down.
TIMESTAMP: 2023-07-13T16:07:00.302888589Z

MESSAGE:
TIMESTAMP: 2023-07-13T16:07:00.465454Z

MESSAGE:
TIMESTAMP: 2023-07-13T16:07:04.248951070Z

MESSAGE:
TIMESTAMP: 2023-07-13T16:07:14.815270Z

MESSAGE:
TIMESTAMP: 2023-07-13T16:07:14.815774Z

from hpc-toolkit.

sharif-cameco avatar sharif-cameco commented on August 28, 2024

MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: ERROR:main:config file not found: /slurm/scripts/config.yaml

Is the location wrong or the config file should be here but it is?

from hpc-toolkit.

tpdownes avatar tpdownes commented on August 28, 2024

I realize there are a lot of warnings in there but many of them are retries.

MESSAGE: Jul 13 16:06:58 hpcsmall-debug-ghpc-0 slurmd: slurmd: launch task StepId=25.0 request from UID:2099065396 GID:2099065396 HOST:10.161.0.60 PORT:50428
TIMESTAMP: 2023-07-13T16:06:58.491820935Z

MESSAGE: Jul 13 16:06:58 hpcsmall-debug-ghpc-0 slurmd: slurmd: task/affinity: lllp_distribution: JobId=25 implicit auto binding: sockets,one_thread, dist 8192
TIMESTAMP: 2023-07-13T16:06:58.491826766Z

MESSAGE: Jul 13 16:06:58 hpcsmall-debug-ghpc-0 slurmd: slurmd: task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic
TIMESTAMP: 2023-07-13T16:06:58.491827112Z

MESSAGE: Jul 13 16:06:58 hpcsmall-debug-ghpc-0 slurmd: slurmd: task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [25]: mask_cpu,one_thread, 0x1
TIMESTAMP: 2023-07-13T16:06:58.491827397Z

This reads as though job 25 may have matched, execute, and finished. I will confirm by looking at other logs on a Slurm cluster I provisioned.

What is odd is that you are using the debug partition (configured with "exclusive: false") that should cause it to remain powered on for several minutes after a job completes. Did you alter any settings of hpc-slurm.yaml?

from hpc-toolkit.

tpdownes avatar tpdownes commented on August 28, 2024

This error:

MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: ERROR:main:config file not found: /slurm/scripts/config.yaml

and the Slurm version "22.05.4" are both leaping out at me. If you are running the tutorial from the most recent commit on main, you should have version v1.20.0 which would provision a node with 22.05.9. git log would begin with this:

commit 252694acbe160611948341ba24f6e010539cfa52 (HEAD -> main, tag: v1.20.0, upstream/main, origin/main, origin/HEAD)

Did you run this tutorial a while back and are coming back to it? You might start with a git pull while on the main branch.

from hpc-toolkit.

sharif-cameco avatar sharif-cameco commented on August 28, 2024

Hi Tom

Yes I deployed it 3 months ago and now the team wants to use it. Ok I am getting the latest version.
Thank you very much.

Regards
Sharif

from hpc-toolkit.

tpdownes avatar tpdownes commented on August 28, 2024

The crux of the matter is this error. That would be fatal (the Slurm machine boots up but can't configure itself)

google_metadata_script_runner: startup-script: ERROR:main:config file not found: /slurm/scripts/config.yaml

I think there may have been a quickly-fixed bug that would result in this error a couple months ago. If you run srun -N3 hostname on the latest release, you should observe:

  • 3 VMs power on
  • hostname job runs quickly
  • the machines power down after 5 minutes of being idle

Please open a new issue if you do not see that. Thanks!

from hpc-toolkit.

sharif-cameco avatar sharif-cameco commented on August 28, 2024

Hi

I deployed the 1.20 version and getting the following error.

[issharif_c_cameco_com@hpcsmall-login-vicyomx9-001 ~]$ srun -N3 hostname
srun: error: Node failure on hpcsmall-debug-ghpc-0
srun: error: Nodes hpcsmall-debug-ghpc-[0-2] are still not ready
srun: error: Something is wrong with the boot of the nodes.
[issharif_c_cameco_com@hpcsmall-login-vicyomx9-001 ~]$

Regards

from hpc-toolkit.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.