Comments (7)
I have added retries that will hopefully prevent failure on this in the future. I have also added an integration test for the chrome-remote-desktop which will help to keep this installation robust to changes. I am going to consider this bug fixed. Please re-open if you feel the fix does not address the bug.
from hpc-toolkit.
This seems like a bug in the setup process since installing the chromoting tool seems to be part of the startup scripts of the module:
from hpc-toolkit.
There seems to be a conflict wrt to apt/dpkg locking in the startup script:
Mar 13 14:03:27 radlab-remote-desktop-0 google_metadata_script_runner[1570]: startup-script: Mon Mar 13 14:03:27 +0000 2023 Info [1778]: === start executing runner: configure-grid-drivers.yml ===
Mar 13 14:03:27 radlab-remote-desktop-0 systemd[1]: Started Daemon for generating UUIDs.
Mar 13 14:03:28 radlab-remote-desktop-0 google_metadata_script_runner[1570]: startup-script:
Mar 13 14:03:28 radlab-remote-desktop-0 google_metadata_script_runner[1570]: startup-script: PLAY [Ensure nvidia grid drivers and other binaries are installed] *************
Mar 13 14:03:28 radlab-remote-desktop-0 google_metadata_script_runner[1570]: startup-script:
Mar 13 14:03:28 radlab-remote-desktop-0 google_metadata_script_runner[1570]: startup-script: TASK [Gathering Facts] *********************************************************
Mar 13 14:03:28 radlab-remote-desktop-0 dbus-daemon[635]: [system] Reloaded configuration
Mar 13 14:03:28 radlab-remote-desktop-0 dbus-daemon[635]: message repeated 4 times: [ [system] Reloaded configuration]
Mar 13 14:03:29 radlab-remote-desktop-0 google_metadata_script_runner[1570]: startup-script: ok: [localhost]
Mar 13 14:03:29 radlab-remote-desktop-0 google_metadata_script_runner[1570]: startup-script:
Mar 13 14:03:29 radlab-remote-desktop-0 google_metadata_script_runner[1570]: startup-script: TASK [Get kernel release] ******************************************************
Mar 13 14:03:29 radlab-remote-desktop-0 google_metadata_script_runner[1570]: startup-script: ok: [localhost]
Mar 13 14:03:29 radlab-remote-desktop-0 google_metadata_script_runner[1570]: startup-script:
Mar 13 14:03:29 radlab-remote-desktop-0 google_metadata_script_runner[1570]: startup-script: TASK [Install binaries for GRID drivers] ***************************************
Mar 13 14:03:31 radlab-remote-desktop-0 systemd[1]: Starting Update APT News...
Mar 13 14:03:31 radlab-remote-desktop-0 systemd[1]: Starting Update the local ESM caches...
Mar 13 14:03:31 radlab-remote-desktop-0 systemd[1]: apt-news.service: Deactivated successfully.
Mar 13 14:03:31 radlab-remote-desktop-0 systemd[1]: Finished Update APT News.
Mar 13 14:03:31 radlab-remote-desktop-0 systemd[1]: esm-cache.service: Deactivated successfully.
Mar 13 14:03:31 radlab-remote-desktop-0 systemd[1]: Finished Update the local ESM caches.
Mar 13 14:03:32 radlab-remote-desktop-0 dbus-daemon[635]: [system] Reloaded configuration
Mar 13 14:03:32 radlab-remote-desktop-0 dbus-daemon[635]: message repeated 2 times: [ [system] Reloaded configuration]
Mar 13 14:03:32 radlab-remote-desktop-0 dbus-daemon[635]: Unknown username "rtkit" in message bus configuration file
Mar 13 14:03:32 radlab-remote-desktop-0 dbus-daemon[635]: [system] Reloaded configuration
Mar 13 14:03:32 radlab-remote-desktop-0 dbus-daemon[635]: Unknown username "rtkit" in message bus configuration file
Mar 13 14:03:32 radlab-remote-desktop-0 dbus-daemon[635]: [system] Reloaded configuration
Mar 13 14:03:33 radlab-remote-desktop-0 google_metadata_script_runner[1570]: startup-script: fatal: [localhost]: FAILED! => {"cache_update_time": 1678716211, "cache_updated": true, "changed": false, "msg": "'/usr/bin/apt-get -y -o \"Dpkg::Options::=--force-confdef\" -o \"Dpkg::Options::=--force-confold\" install 'gdebi-core' 'mesa-utils' 'gdm3'' failed: E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 3339 (apt-get)\nE: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?\n", "rc": 100, "stderr": "E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 3339 (apt-get)\nE: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?\n", "stderr_lines": ["E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 3339 (apt-get)", "E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?"], "stdout": "", "stdout_lines": []}
Mar 13 14:03:33 radlab-remote-desktop-0 google_metadata_script_runner[1570]: startup-script:
Mar 13 14:03:33 radlab-remote-desktop-0 google_metadata_script_runner[1570]: startup-script: PLAY RECAP *********************************************************************
Mar 13 14:03:33 radlab-remote-desktop-0 google_metadata_script_runner[1570]: startup-script: localhost : ok=2 changed=0 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0
Mar 13 14:03:33 radlab-remote-desktop-0 google_metadata_script_runner[1570]: startup-script:
Mar 13 14:03:33 radlab-remote-desktop-0 google_metadata_script_runner[1570]: startup-script: Mon Mar 13 14:03:33 +0000 2023 Info [1778]: === configure-grid-drivers.yml finished with exit_code=2 ===
Mar 13 14:03:33 radlab-remote-desktop-0 google_metadata_script_runner[1570]: startup-script: Mon Mar 13 14:03:33 +0000 2023 Error [1778]: === execution of configure-grid-drivers.yml failed, exiting ===
Mar 13 14:03:33 radlab-remote-desktop-0 google_metadata_script_runner[1570]: startup-script: Mon Mar 13 14:03:33 +0000 2023 Info [1576]: === passed_startup_script.sh finished with exit_code=2 ===
Mar 13 14:03:33 radlab-remote-desktop-0 google_metadata_script_runner[1570]: startup-script: Mon Mar 13 14:03:33 +0000 2023 Error [1576]: === execution of passed_startup_script.sh failed, exiting ===
Mar 13 14:03:33 radlab-remote-desktop-0 google_metadata_script_runner[1570]: startup-script exit status 2
Mar 13 14:03:33 radlab-remote-desktop-0 google_metadata_script_runner[1570]: Finished running startup scripts.
from hpc-toolkit.
maybe there is an option to have ansible try to acquire a lock on the dpkg stuff before running the recipe?
from hpc-toolkit.
/cc @nick-stroud
from hpc-toolkit.
I suspect this is coming from a conflict with unattended-upgrades holding the lock. We have seen similar before with startup scripts on debian based images. Historically our approach has been to add retries.
from hpc-toolkit.
Released in v1.16.0.
from hpc-toolkit.
Related Issues (20)
- Update ml-slurm blueprint to use updated base image for schedmd debian 11 HOT 1
- Unable to dynamically modify the number of nodes in a slurm cluster HOT 2
- Slurm nodes with hybrid controller module unable to configure correctly HOT 2
- error when use packer to build image in ml-slurm HOT 2
- Unable to configure Slurm due to failure to mount filestore HOT 5
- Feature request: support `hashicorp/google` and `hashicorp/google-beta` v5
- private_vpc_connection is not unique per cluster in slurm-sql module HOT 4
- ERROR: failed to sync instances when issuing `scontrol reboot` HOT 4
- Example of startup script with cluster without vm-instance? HOT 2
- Broken link HOT 1
- PMIx MPI support in Slurm HOT 16
- IP space of [gcp project subnet] is exhausted when deploying a GCP Slurm cluster HOT 2
- Packer custom image does not use specified service account email. HOT 3
- Upgrade to Ops Agent fails HOT 6
- HTCondor tutorial: add cloudresourcemanager.googleapis.com to the list of services to enable HOT 8
- Fail to consume shared reservations HOT 4
- No CUDA devices visible with A2 instances HOT 2
- Missing set credentials on fs creation triggered by validator HOT 5
- Rocky image failing due to 404 on lustre-client HOT 5
- Using a newer version of Terraform can lead to controller replacement on reconfigure for Slurm GCP v6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hpc-toolkit.