Git Product home page Git Product logo

guest-agent's Introduction

Guest Agent for Google Compute Engine

This repository contains the source code and packaging artifacts for the Google guest agent and metadata script runner binaries. These components are installed on Windows and Linux GCE VMs in order to enable GCE platform features.

Table of Contents

Overview

The repository contains these components:

  • google-guest-agent daemon which handles all of the areas outlined below in "features"
  • google-metadata-script-runner binary to run user-provided scripts at VM startup and shutdown.

Features

The guest agent functionality can be separated into various areas of responsibility. Historically, on Linux these were managed by separate independent processes, but today they are all managed by the guest agent.

The Daemons section of the instance configs file on Linux refers to these areas of responsibility. This allows a user to easily modify or disable functionality. Behaviors for each area of responsibility are detailed below.

Account management

On Windows, the agent handles creating user accounts and setting/resetting passwords.

On Linux: If OS Login is not used, the guest agent will be responsible for provisioning and deprovisioning user accounts. The agent creates local user accounts and maintains the authorized SSH keys file for each. User account creation is based on adding and remove SSH Keys stored in metadata.

The guest agent has the following behaviors:

  • Administrator permissions are managed with a google-sudoers Linux group. Members of this group are granted sudo permissions on the VM.
  • All users provisioned by the account daemon are added to the google-sudoers group.
  • The daemon stores a file in the guest to record which user accounts are managed by Google.
  • User accounts not managed by the agent are not touched by the accounts daemon.
  • The authorized keys file for a Google managed user is deleted when all SSH keys for the user are removed from metadata.
  • Users accounts managed by the agent will be added to the groups config line in the Accounts section. If these groups do not exist, the agent will not create them.

OS Login

(Linux only)

If the user has configured OS Login via metadata, the guest agent will be responsible for configuring the OS to use OS Login, otherwise called 'enabling' OS Login. This consists of:

  • Adding a Google config block to the SSHD configuration file and restarting SSHD.
  • Adding OS Login entries to the nsswitch.conf file.
  • Adding OS Login entries to the PAM configuration file for SSHD.

If the user disables OS login via metadata, the configuration changes will be removed.

Note that options under the Accounts section of the configuration do not apply to oslogin users.

Clock Skew

(Linux only)

The guest agent is responsible for syncing the software clock with the hypervisor clock after a stop/start event or after a migration. Preventing clock skew may result in system time has changed messages in VM logs.

Network

The guest agent uses network interface metadata to manage the network interfaces in the guest by performing the following tasks:

  • Enabled all associated network interfaces on boot.
  • Setup or remove IP routes in the guest for IP forwarding and IP aliases
    • Only IPv4 IP addresses are currently supported.
    • Routes are set on the primary ethernet interface.
    • Google routes are configured, by default, with the routing protocol ID 66. This ID is a namespace for daemon configured IP addresses. It can be changed with the config file, see below.

Windows Failover Cluster Support

(Windows only)

The agent can monitor the active node in the Windows Failover Cluster and coordinate with GCP Internal Load Balancer to forward all cluster traffic to the expected node.

The following fields on instance metadata or instance_configs.cfg can control the behavior:

  • enable-wsfc: If set to true, all IP forwarding info will be ignored and agent will start responding to the health check port. Default false.
  • wsfc-agent-port: The port which the agent will respond to health checks. Default 59998.
  • wsfc-addrs: A comma separated list of IP address. This is an advanced setting to enable user have both normal forwarding IPs and cluster IPs on the same instance. If set, agent will only skip-auto configuring IPs in the list. Default empty.

Instance Setup

(Linux only)

The guest agent will perform some actions once each time on startup:

  • Optimize for local SSD.
  • Enable multi-queue on all the virtionet devices.

The guest agent will perform some actions one time only, on the first VM boot:

  • Generate SSH host keys.
  • Create the boto config for using Google Cloud Storage.

Telemetry

The guest agent will record some basic system telemetry information at start and then once every 24 hours.

  • Guest agent version and architecture
  • Operating system name and version
  • Operating system kernel release and version

Telemetry can be disabled by setting the metadata key disable-guest-telemetry to true.

MTLS MDS

GCE Shielded VMs now support HTTPS endpoint https://metadata.google.internal/computeMetadata/v1 for Metadata Server. To enable communication with secure HTTPS endpoint, Guest Agent retrieves and stores credentials on the VM's disk in a standard location, making them accessible to any client application running on the VM. Both the root certificate and client credentials are updated each time the guest-agent process starts. For enhanced security, client credentials are automatically refreshed every 48 hours. The agent generates and saves new credentials, while the old ones remain valid. This overlap period ensures that clients have sufficient time to transition to the new credentials before the old ones expire, and it allows the agent to retry in case of failure and obtain valid credentials before the existing ones become invalid. Client credentials are basically EC private key and the client certificate concatenated. These credentials are unique to an instance and would not work elsewhere.

Refer this for more information on HTTPS metadata server endpoint and credential details including their lifespan.

Credentials are stored at these locations -

  • Linux:

    • Client credentials: /run/google-mds-mtls/client.key
    • Root certificate: /run/google-mds-mtls/root.crt and local trust store based on target OS. Refer this for local trust store location for each target OS.
  • Windows:

    • Client credentials: C:\ProgramData\Google\ComputeEngine\mds-mtls-client.key and Cert:\LocalMachine\My
    • Root certificate: C:\ProgramData\Google\ComputeEngine\mds-mtls-root.crt and Cert:\LocalMachine\Root
    • PFX: C:\ProgramData\Google\Compute Engine\mds-mtls-client.key.pfx

    Credentials are stored on disk as well as in Certificate Store on Windows

Note that this is enabled automatically if HTTPS endpoint is supported on a VM. This can be disabled by setting mtls_bootstrapping_enabled = false under [MDS] section in instance_configs.cfg file.

Local root trust store is updated by running update-ca-certificates or update-ca-trust tool based on the OS or by adding cert to Cert:\LocalMachine\Root on Windows. This can be separately disabled by setting cacertificates_update_enabled = false in under the same MDS section.

Metadata Scripts

Metadata scripts implement support for running user provided startup scripts and shutdown scripts. The guest support for metadata scripts is implemented in Python with the following design details.

  • Metadata scripts are executed in a shell.
  • If multiple metadata keys are specified (e.g. startup-script and startup-script-url) both are executed.
  • If multiple metadata keys are specified (e.g. startup-script and startup-script-url) a URL is executed first.
  • The exit status of a metadata script is logged after completed execution.

Configuration

Users of Google provided images may configure the guest environment behaviors using a configuration file.

To make configuration changes on Windows, follow these instructions

To make configuration changes on Linux, add settings to /etc/default/instance_configs.cfg. If you are attempting to change the behavior of a running instance, restart the guest agent after modifying.

Linux distributions looking to include their own defaults can specify settings in /etc/default/instance_configs.cfg.distro. These settings will not override /etc/default/instance_configs.cfg. This enables distribution settings that do not override user configuration during package update.

The following are valid user configuration options.

Section Option Value
Accounts deprovision_remove true makes deprovisioning a user destructive.
Accounts groups Comma separated list of groups for newly provisioned users created from metadata ssh keys.
Accounts useradd_cmd Command string to create a new user.
Accounts userdel_cmd Command string to delete a user.
Accounts usermod_cmd Command string to modify a user's groups.
Accounts gpasswd_add_cmd Command string to add a user to a group.
Accounts gpasswd_remove_cmd Command string to remove a user from a group.
Accounts groupadd_cmd Command string to create a new group.
Daemons accounts_daemon false disables the accounts daemon.
Daemons clock_skew_daemon false disables the clock skew daemon.
Daemons network_daemon false disables the network daemon.
InstanceSetup host_key_types Comma separated list of host key types to generate.
InstanceSetup optimize_local_ssd false prevents optimizing for local SSD.
InstanceSetup network_enabled false skips instance setup functions that require metadata.
InstanceSetup set_boto_config false skips setting up a boto config.
InstanceSetup set_host_keys false skips generating host keys on first boot.
InstanceSetup set_multiqueue false skips multiqueue driver support.
IpForwarding ethernet_proto_id Protocol ID string for daemon added routes.
IpForwarding ip_aliases false disables setting up alias IP routes.
IpForwarding target_instance_ips false disables internal IP address load balancing.
MetadataScripts default_shell String with the default shell to execute scripts.
MetadataScripts run_dir String base directory where metadata scripts are executed.
MetadataScripts startup false disables startup script execution.
MetadataScripts shutdown false disables shutdown script execution.
NetworkInterfaces setup false skips network interface setup.
NetworkInterfaces ip_forwarding false skips IP forwarding.
NetworkInterfaces manage_primary_nic true will start managing the primary NIC in addition to the secondary NICs.
NetworkInterfaces dhcp_command String path for alternate dhcp executable used to enable network interfaces.
OSLogin cert_authentication false prevents guest-agent from setting up sshd's TrustedUserCAKeys, AuthorizedPrincipalsCommand and AuthorizedPrincipalsCommandUser configuration keys. Default value: true.

Setting network_enabled to false will disable generating host keys and the boto config in the guest.

Packaging

The guest agent and metadata script runner are packaged in DEB, RPM or Googet format packages which are published to Google Cloud repositories and preinstalled on Google managed GCE Images. Packaging scripts for each platform are stored in the packaging/ directory.

We build the following packages for the Windows guest environment:

google-compute-engine-windows - contains the guest agent executable. google-compute-engine-metadata-scripts - contains files to run startup and shutdown scripts.

We build the following packages for the Linux guest environment:

google-guest-agent - contains the guest agent and metadata script runner executables, as well as service files for both.

guest-agent's People

Contributors

a-crate avatar adjackura avatar arekkusu avatar bkatyl avatar bpl4vv avatar chaitanyakulkarni28 avatar dorileo avatar drewhli avatar ericdand avatar ericedens avatar gaohannk avatar hopkiw avatar illfelder avatar jjerger avatar koln67 avatar lawrencehwang avatar linskeyd avatar matir avatar maxnelso avatar mike-kochera avatar oleksiyivanenko avatar patelne avatar rofuentes avatar sejalsharma-google avatar tavishvaidya avatar therealfalcon avatar tpdownes avatar wrigri avatar yanglu1031 avatar zmarano avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

guest-agent's Issues

Error messages with immutable `/etc`

On NixOS we have most of /etc generated with our declarative configuration system and mounted read-only (sans /etc/{users,groups,shadow,gshadow} when we allow mutable users). For example, we have our GCE support described in one of our configuration modules.

Because of this approach parts of Guest Agent that set up system environment by modifying configuration files fail, producing error messages like those:

2022-01-10T14:38:39.5728Z GCEGuestAgent Error non_windows_accounts.go:107: Error creating google-sudoers file: open /etc/sudoers.d/google_sudoers: no such file or directory.
2022-01-10T14:38:39.6548Z GCEGuestAgent Error oslogin.go:91: Error updating SSH config: open /etc/ssh/sshd_config: read-only file system.
2022-01-10T14:38:39.6557Z GCEGuestAgent Error oslogin.go:99: Error updating PAM config: open /etc/pam.d/sshd: read-only file system.
2022-01-10T14:38:39.6558Z GCEGuestAgent Error oslogin.go:103: Error updating group.conf: open /etc/security/group.conf: no such file or directory.
2022-01-10T14:38:39.6564Z GCEGuestAgent Error oslogin.go:118: Error creating OS Login sudoers file: open /etc/sudoers.d/google-oslogin: no such file or directory.

Our current approach is adding instance configuration options for that, as described in #152. However, as @hopkiw mentioned extending these is to be avoided. What are the other possible solutions?

Unable to inspect configuration of running google-guest-agent

Filed in Ubuntu Bug #1901042 as well.

We have a cloud-images qualification test for the google-guest-agent to ensure the following daemons are enabled for GCE images:

  • accounts_daemon
  • clock_skew_daemon
  • network_daemon

I could not find a way to inspect a currently running agent to determine the configuration in use. The agent only takes the following actions:

  • run
  • install
  • remove
  • start
  • stop
  • help

An additional action to list the configuration in use would be very helpful.

I also looked at the output of journalctl -u google-guest-agent.service, but the agent does not log configuration settings in the log, either.

Confusing error message when SSH service is called 'sshd'

The change in #106 added the capability to start the SSH service:

	// SSH should be explicitly started if not running.
	for _, svc := range []string{"ssh", "sshd"} {
		if err := startService(svc, true); err != nil {
			logger.Errorf("Error restarting service: %v.", err)
		} else {
			// Stop on first matching, to avoid double restarting.
			break
		}

On openSUSE/SLE systems, the SSH service is called sshd which causes the above code snippet to output a confusing error message into the daemon's log due to the fact that the agent tries to start ssh first, then prints an error message and then tries to start sshd which is successful.

May 11 11:07:33 sle-15-jpag-test GCEGuestAgent[1147]: 2021-05-11T11:07:33.2492Z GCEGuestAgent Error oslogin.go:115: Error restarting service: Failed to start ssh.service: Unit ssh.service not found.
                                                      .
May 11 11:07:33 sle-15-jpag-test google_guest_agent[1147]: 2021/05/11 11:07:33 logging client: rpc error: code = PermissionDenied desc = Cloud Logging API has not been used in project 284177885636 before or it is disabled. Enable it by visiting https://console.develope>
May 11 11:10:33 sle-15-jpag-test GCEGuestAgent[1147]: 2021-05-11T11:10:33.5435Z GCEGuestAgent Info: Updating keys for user suse_gce.
May 11 11:10:34 sle-15-jpag-test google_guest_agent[1147]: 2021/05/11 11:10:34 logging client: rpc error: code = PermissionDenied desc = Cloud Logging API has not been used in project 284177885636 before or it is disabled. Enable it by visiting https://console.develope>
May 11 11:13:27 sle-15-jpag-test systemd[1]: Stopping Google Compute Engine Guest Agent...
May 11 11:13:27 sle-15-jpag-test GCEGuestAgent[1147]: 2021-05-11T11:13:27.9713Z GCEGuestAgent Info: GCE Agent Stopped
May 11 11:13:27 sle-15-jpag-test systemd[1]: Stopped Google Compute Engine Guest Agent.
May 11 11:13:27 sle-15-jpag-test systemd[1]: Started Google Compute Engine Guest Agent

The code should probably check first what the name of the SSH service is before trying to start it. This way the value of err will actually reflect whether the failure of starting the service was not caused by trying to start ssh on a system where the service is actually called sshd.

Restart Agent when SystemD Network unit is restarted

Environment

OS: Ubuntu 20.04 LTS
Kernel: 5.4.0-1037-gcp #40-Ubuntu SMP Fri Feb 5 11:57:53 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
SystemD version: 245.4-4ubuntu3.5
Google Guess Agent version: 20201217.02-0ubuntu1~20.04.0

Problem

With the release of a security upgrade by Ubuntu on package systemd, the SystemD service systemd-networkd is restarted. This can make a GCP instance impaired for serving traffic.

When the systemd-networkd.service unit is restarted, the operating system local routing table is wiped. This cause the local host routes for Google Cloud regional TCP Load Balancers to disappear and produce the following behavior:

  • The health checks, originated from the TCP LB service IP, start failing because the node does not have a host route for it
  • With all instances in a failed state, the TCP LB enters into an always-open state. The traffic directed to the TCP LB service IP is being dropped by the instances (never answer to the TCP SYN packet) because of the lack of the host route.

The triage for this issue is restarting the google-guest-agent.service SystemD unit, so host routes are added back and both health checks and traffic start working again.

Reproduction steps

  1. Create a TCP regional LB in a given region (does not matter if the public IP is static or ephemeral)
  2. Configure a GCP instance in the same region as a backend instance. Configure a basic TCP health check on a TCP port that is wide open
  3. Configure a frontend listener on port 80 using an ephemeral IP
  4. Wait for it to be created
  5. SSH to the instance and verify that TCP LB ephemeral IP is listed as host route in the output of ip ro list table local
  6. Restart systemd-networkd using systemd restart systemd-networkd
  7. Check the local route table again and verify the route is no longer there.

At this point, the route won't be re-added. You need to restart the google-guest-agent.service SystemD unit to the routes to be re-added.

Solution

The systemd-networkd.service unit is not listed as part of the PartOf directive in the Google Guest Agent service unit configuration. See https://github.com/GoogleCloudPlatform/guest-agent/blob/master/google-guest-agent.service#L7

There is an item in the PartOf for networking.service, but this systemd unit is managed by ifupdown package. In this specific user case, SystemD is also network managed and we'll need to consider it like that in the google-guest-agent.service configuration.

Google Guest Agent Throws SSH Key Error Even Though User Has Access to Google Cloud VM

Steps to Reproduce (Tested on Ubuntu 20.04.02)

  1. Create a Google Cloud Engine VM with base Ubuntu 20.04
  2. The vm boots successfully
  3. Cloud-Init image completes (applied via the metadata section of Google Engine using terraform)
  4. Project-wide SSH keys added to VM (applied via the metadata section of Google Engine using terraform)
  5. The package google-guest-agent.service throws an error (see log below) saying "Invalid ssh key entry - unrecognized format" even though the said user can access the VM via SSH.

The Problem
These VM instances are then used to produce a standard image template so this error is reproduced across our entire VM estate as each VM is created from this image.

It doesn't seem to affect our project-wide SSH keys from establishing an SSH session to any VMs in the Google Cloud project but it does mean that this error is being seen everywhere as it is propagated by the VM image used to generate all of our VMs.

The package version this error has been seen on is -
google-guest-agent_20220622.00-0ubuntu2~20.04.0_amd64.deb

Dec  8 09:51:14 instance-1 systemd[1]: Condition check resulted in Bluetooth service being skipped.
Dec  8 09:51:14 instance-1 rtkit-daemon[1944]: Supervising 0 threads of 0 processes of 1 users.
Dec  8 09:51:14 instance-1 rtkit-daemon[1944]: message repeated 4 times: [ Supervising 0 threads of 0 processes of 1 users.]
Dec  8 09:51:14 instance-1 systemd[2036]: Started D-Bus User Message Bus.
Dec  8 09:51:14 instance-1 dbus-daemon[2060]: [session uid=1001 pid=2060] AppArmor D-Bus mediation is enabled
Dec  8 09:51:14 instance-1 systemd[2036]: Started Sound Service.
Dec  8 09:51:14 instance-1 systemd[2036]: Reached target Main User Target.
Dec  8 09:51:14 instance-1 systemd[2036]: Startup finished in 197ms.
Dec  8 09:52:01 instance-1 google_guest_agent[887]: ERROR non_windows_accounts.go:199 Invalid ssh key entry - unrecognized format: ssh-rsa <hidden_for_security_reasons>= [email protected]

panic: syscall: string with NUL passed to StringToUTF16

Trying to run a powershell startup script on my machine and getting this error from the agent as its running:

panic: syscall: string with NUL passed to StringToUTF16

goroutine 1 [running]:
syscall.StringToUTF16(...)
        /tmp/go/src/syscall/syscall_windows.go:30
syscall.StringToUTF16Ptr(0xc0003657c0, 0x9a, 0x50)
        /tmp/go/src/syscall/syscall_windows.go:65 +0x8b
golang.org/x/sys/windows/svc/eventlog.(*Log).report(0xc00001e0f0, 0x37200000004, 0xc0003657c0, 0x9a, 0x1, 0xc0003fefa0)
        /usr/share/gocode/pkg/mod/golang.org/x/[email protected]/windows/svc/eventlog/log.go:50 +0x40
golang.org/x/sys/windows/svc/eventlog.(*Log).Info(...)
        /usr/share/gocode/pkg/mod/golang.org/x/[email protected]/windows/svc/eventlog/log.go:57
github.com/GoogleCloudPlatform/guest-logging-go/logger.local(0xc000411c00, 0x72, 0x0, 0x3, 0x1, 0xc0003fefa0, 0xc00041b640, 0x19)
        /usr/share/gocode/pkg/mod/github.com/!google!cloud!platform/[email protected]/logger/logger_windows.go:53 +0x11e
github.com/GoogleCloudPlatform/guest-logging-go/logger.Log(0xc000411c00, 0x72, 0x0, 0x3, 0x1, 0xc0003fefa0, 0xc00041b640, 0x19)
        /usr/share/gocode/pkg/mod/github.com/!google!cloud!platform/[email protected]/logger/logger.go:112 +0xdc
main.runCmd(0xc0002c91e0, 0xc00005d260, 0x1a, 0x0, 0x0)
        /guest-agent/google_metadata_script_runner/main.go:321 +0x2a6
main.runScript(0xb85640, 0xc00005e078, 0xc00005d260, 0x1a, 0xc0003382c0, 0x14e, 0x0, 0x0)
        /guest-agent/google_metadata_script_runner/main.go:301 +0x6ff
main.main()
        /guest-agent/google_metadata_script_runner/main.go:473 +0x516

CentOS 8 - google-guest-agent.service fails to start

I'm working with CentOS-8 Images on GCP (projects/centos-cloud/global/images/centos-8-v20200316). When booting with this image, the google-guest-agent.service fails to start and returns the following error :

Mar 23 19:02:26 centos-8-base-image-node systemd[1]: google-guest-agent.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Mar 23 19:02:26 centos-8-base-image-node systemd[1]: google-guest-agent.service: Failed with result 'exit-code'.
Mar 23 19:02:26 centos-8-base-image-node systemd[1]: google-guest-agent.service: Service RestartSec=100ms expired, scheduling restart.
Mar 23 19:02:26 centos-8-base-image-node systemd[1]: google-guest-agent.service: Scheduled restart job, restart counter is at 5.
Mar 23 19:02:26 centos-8-base-image-node systemd[1]: Stopped Google Compute Engine Guest Agent.
Mar 23 19:02:26 centos-8-base-image-node systemd[1]: google-guest-agent.service: Start request repeated too quickly.
Mar 23 19:02:26 centos-8-base-image-node systemd[1]: google-guest-agent.service: Failed with result 'exit-code'.

For reference, the instance that generated this error is created with Hashicorp's Packer (v1.5.4) and executes a startup script that runs yum install on a few packages.

Building google-guest-agent on openSUSE/SLE fails with "undefined: errors.Is"

I'm currently working on packaging guest-agent for openSUSE/SLE [1]. While google_metadata_script_runner builds fine, google-guest-agent fails with the following error:

[    5s] + pushd google_guest_agent
[    5s] ~/rpmbuild/BUILD/guest-agent-20200630.00+git20200630.5b11764/google_guest_agent ~/rpmbuild/BUILD/guest-agent-20200630.00+git20200630.5b11764
[    5s] + CGO_ENABLED=0
[    5s] + go build '-ldflags=-s -w -X main.version=20200630.00+git20200630.5b11764' -mod=vendor
[    5s] # github.com/GoogleCloudPlatform/guest-agent/google_guest_agent
[    5s] ./main.go:256:25: undefined: errors.Is

FWIW, I'm building with -mod=vendor as we don't allow accessing the internet during build time (like it's standard for most Linux distributions).

From the error, it looks like the errors class used in our case does not support the Is() yet.

CC @rjschwei

[1] https://build.opensuse.org/package/show/home:glaubitz:branches:Cloud:Tools/google-guest-agent

CentOS - google-startup-scripts.service start Before=apt-daily.service

I've noticed that on Centos 7.9 the systemd unit file for google-startup-scripts has the start Before=apt-daily.service which does not exist.

[Unit]
Description=Google Compute Engine Startup Scripts
Wants=network-online.target rsyslog.service
After=network-online.target rsyslog.service google-guest-agent.service
Before=apt-daily.service

[Service]
Type=oneshot
ExecStart=/usr/bin/google_metadata_script_runner startup
#TimeoutStartSec is ignored for Type=oneshot service units.
KillMode=process

[Install]
WantedBy=multi-user.target

Dont create /etc/sudoers.d/google_sudoers unless OS Login is enabled

In oslogin.go#L79 we correctly guard the call to accountsMgr.set() on OS Login being enabled, however on main.go#L118 we don't do this check first resulting in /etc/sudoers.d/google_sudoers being created even when the user does not use OS Login.

This creates challenges for customers who use Puppet to manage the /etc/sudoers.d directory. Puppet deletes this file and then the agent recreates it causing churn.

Issue with 2 guest agents

Hello,

I noticed that when my VM boots, it will randomly choose between two guest environments. In the logs, I see "GCE Agent Started" with either (version 20210908.1) or (version 20200610.00). The host key changes along with the agent version. The issue is that I am only able to access my conda environments and other information on one of these versions. Is there any way I can control or choose which agent version is used on startup?

Thank you in advance!

Best,
Dan

oslogin /etc/nsswitch.conf changes conflict with RHEL 8 authselect.

Adding cache_oslogin and oslogin to /etc/nsswitch.conf causes issues with authselect on RHEL 8. authselect to gives this message when attempting to apply changes:

$ sudo authselect apply-changes
[error] [/etc/authselect/nsswitch.conf] has unexpected content!
[error] Unexpected changes to the configuration were detected.
[error] Refusing to activate profile unless those changes are removed or overwrite is requested.
Some unexpected changes to the configuration were detected. Use 'select' command instead.

This can be worked around by creating a custom authselect profile and configuring its nsswitch to already have " cache_oslogin oslogin" on the passwd and group lines.

How to prevent from the guest-agent to delete user, created by packer?

Hi,
I'm trying to build an image using Packer. I noticed that guest-agent removed my user:

Mar 6 10:59:20 workstation-test-6-03-2023-3 google_guest_agent[544]: Removing user packer.

How can I prevent that?
According to Readme, 'User accounts not managed by Google are not touched by the accounts daemon.'. I tried both 'packer' and 'ubuntu' for Debian image - the same behavior.

TIA,
Vitaly

Improper parsing of /etc/passwd when username is suffix of another user

Steps to reproduce:

  1. Create a user testagent with SSH keys in project/instance metadata.
  2. Wait for agent to create user & provision.
  3. Create a user agent with SSH keys in project/instance metadata.
  4. Observe keys for user agent written into /home/testagent/.ssh/authorized_keys

This occurs because the code for getPasswd only checks that the entry in /etc/passwd contains the username followed by :. Of course, it only occurs if the longer username is first in /etc/passwd and the shorter username is 2nd in the project/instance metadata.

I'll send a PR with a fix shortly.

google-guest-agent breaks down without ipv6 enabled

Summary:

This looks very similar to #54 which is supposedly resolved, but is still happening on google-guest-agent version 20221109.00. The only difference that ipv6 is disabled in grub by adding ipv6.disable=0 option to GRUB_CMDLINE_LINUX in /etc/default/grub.

Issue details:

google-guest-agent service starts, but does not perform any functions, such as setting up ssh keys from metadata and others. The following error is seen in journalctl:

google_guest_agent[6544]: GCE Agent Started (version 20221109.00)
google_guest_agent[6544]: Enabling OS Login
google_guest_agent[6544]: ERROR addresses.go:301 Error configuring IPv6: Internet Systems Consortium DHCP Client 4.2.5
google_guest_agent[6544]: Copyright 2004-2013 Internet Systems Consortium.
google_guest_agent[6544]: All rights reserved.
google_guest_agent[6544]: For info, please visit https://www.isc.org/software/dhcp/
google_guest_agent[6544]: no link-local IPv6 address for eth0
google_guest_agent[6544]: This version of ISC DHCP is based on the release available
google_guest_agent[6544]: on ftp.isc.org.  Features have been added and other changes
google_guest_agent[6544]: have been made to the base software release in order to make
google_guest_agent[6544]: it work better with this distribution.
google_guest_agent[6544]: Please report for this software via the CentOS Bugs Database:
google_guest_agent[6544]: http://bugs.centos.org/
google_guest_agent[6544]: exiting.

Instance details:

  • OS: CentOS 7
  • Custom image created by using projects/centos-cloud/global/images/centos-7-v20230306 public image and applying CIS hardening. No google-guest-agent related configuration changed.

Disabling ipv6 (if it's not used) to reduce attack surface is a common security practice, part of CIS Benchmarks, etc. Therefore I think google-guest-agent shouldn't strictly depend on ipv6 until it's more widely adopted.

debian packaging not updated

debian change log is still stuck with version 20191204.00, but upstream is at 20210524.00.

The other more immanent problem is: it does not build with debian testing anymore:

make[1]: Entering directory '/home/dev/google-guest-agent'
dh_auto_build -O--buildsystem=golang -- -ldflags="-s -w -X main.version="
go: cloud.google.com/[email protected]: module lookup disabled by GOPROXY=off
	cd obj-x86_64-linux-gnu && go install -trimpath -v -p 16 "-ldflags=-s -w -X main.version="
go: cloud.google.com/[email protected]: module lookup disabled by GOPROXY=off
dh_auto_build: error: cd obj-x86_64-linux-gnu && go install -trimpath -v -p 16 "-ldflags=-s -w -X main.version=" returned exit code 1
make[1]: *** [debian/rules:25: override_dh_auto_build] Error 25
make[1]: Leaving directory '/home/dev/google-guest-agent'
make: *** [debian/rules:13: build] Error 2
dpkg-buildpackage: error: debian/rules build subprocess returned exit status 2
debuild: fatal error at line 1182:
dpkg-buildpackage -us -uc -ui failed

Local route not set on MIG instances with ipv6 disabled behind TCP ILB

Thanks to GCP Support on a case, we found that the following setup causes guest-agent to not set local route to ILB IP address:

  • 1 http healthcheck on port 80
  • 1 instance template
  • 1 MIG with 3 instances
  • 1 TCP ILB
  • 1 BackendService

Instances are debian-9 based, with a custom image build on top of debian-cloud images.
We use packer to build images, and to setup instance middlewares and parameters.

By default, we disable ipv6 with sysctl flags:

  • net.ipv6.conf.all.disable_ipv6=1
  • net.ipv6.conf.default.disable_ipv6=1

When instance boots, and launches guest-agent, we get this error:

GCEGuestAgent Error main.go:119: error running &main.addressMgr{} manager: Internet Systems Consortium
DHCP Client 4.3.5#012Copyright 2004-2016 Internet Systems Consortium.#012All rights reserved.#012For info, please visit https://www.isc.org/software/dhcp/#012#012no link-local IPv6 address for eth0#012#012If you think you have received this message due to a bug rather#012than a configuration issue please read the section on submitting#012bugs on either our web page at www.isc.org or in the README file#012before submitting a bug.  These pages explain the proper#012process and the information we find helpful for debugging..#012#012exiting.

That happens right before setting up local route to ILB ip address.

Thanks in advance.

Guest agent creates the same user accounts using different UIDs across multi-VM deployment

Guest agent creates the same user accounts using different UID's across multi-VM deployment

Steps to reproduce

  • Create more than one project members
  • Deploy a Linux-based multi-VM environment using the same VM image (for example CentOS 7.9)
  • Check list of OS users

Expected results
The same users have the same UID's across multi-VM deployment

Actual results
All project users were created with different UIDs:

$ for i in host1 host2 host3 host4; do ssh $i 'hostname; getent passwd | tail -3'; done

host1
user8:x:1001:1002::/home/user8:/bin/bash
user1:x:1003:1004::/home/user1:/bin/bash
user9:x:1008:1009::/home/user9:/bin/bash

host2
user8:x:1002:1003::/home/user8:/bin/bash
user1:x:1004:1005::/home/user1:/bin/bash
user9:x:1008:1009::/home/user9:/bin/bash

host3
user9:x:1002:1003::/home/user9:/bin/bash
user1:x:1006:1007::/home/user1:/bin/bash
user8:x:1007:1008::/home/user8:/bin/bash

host4
user9:x:1004:1005::/home/user9:/bin/bash
user8:x:1005:1006::/home/user8:/bin/bash
user1:x:1006:1007::/home/user1:/bin/bash

Note: The previous Python-based implementation worked as expected and all users had the same UIDs.

google-guest-agent.service interferes with policy routing

Overview

Using a centos-cloud/centos8 multinic VM instance as an IP router between two VPC networks, Policy Routing is configured as below. When systemctl restart google-guest-agent.service happens, Policy Routing is reconfigured in a way that breaks IP routing.

Expected behavior

I expected DHCP and Google Guest Agent to limit route table modifications to the main tables, but not the user defined tables 10 and 11 below:

[jeff@multinic-a-x9w3 ~]$ sudo ip route list table main
default via 10.33.0.1 dev eth0 proto dhcp metric 100
10.33.0.0/20 via 10.33.0.1 dev eth0 proto dhcp metric 100
10.33.0.1 dev eth0 proto dhcp scope link metric 100
10.37.0.0/20 via 10.37.0.1 dev eth1 proto static
10.37.0.0/20 via 10.37.0.1 dev eth1 proto dhcp metric 101
10.37.0.1 dev eth1 scope link
10.37.0.1 dev eth1 proto dhcp scope link metric 101
[jeff@multinic-a-x9w3 ~]$ sudo ip route list table 10
default via 10.33.0.1 dev eth0
10.33.0.1 dev eth0 scope link
10.37.0.1 dev eth1 scope link
[jeff@multinic-a-x9w3 ~]$ sudo ip route list table 11
default via 10.37.0.1 dev eth1
10.33.0.1 dev eth0 scope link
10.37.0.1 dev eth1 scope link

Actual behavior:

systemctl restart google-guest-agent.service (or DHCP renew after 24 hours) breaks policy routing:

# sudo ip route show table 11 | tee before
default via 10.37.0.1 dev eth1
10.33.0.1 dev eth0 scope link
10.37.0.1 dev eth1 scope link

# sudo systemctl restart google-guest-agent.service

# sudo ip route show table 11 | tee after
10.33.0.1 dev eth0 scope link

Steps to reproduce:

Setup policy routing using a script similar to the following. Note, I'm happy to change this script to be compatible with google-guest-agent.service if there's a way to do so, the intent of the script is to send traffic ingress to nic0 out nic1 and vice-versa, while also supporting ILB health checks coming into both interfaces.

#! /bin/bash
# These tables manage default routes based on policy.
if ! grep -qx '10 nic0' /etc/iproute2/rt_tables; then
  echo "10 nic0" >> /etc/iproute2/rt_tables
fi
if ! grep -qx '11 nic1' /etc/iproute2/rt_tables; then
  echo "11 nic1" >> /etc/iproute2/rt_tables
fi

## These are essentially the same tables, just different default gateways.
# Traffic addresses attached to the nic0 primary interface
ip route add default via "10.33.0.1" dev eth0 table nic0
ip route add "10.33.0.1" dev eth0 scope link table nic0
ip route add "10.37.0.1" dev eth1 scope link table nic0
# Traffic addresses attached to the nic1 primary interface
ip route add default via "10.37.0.1" dev eth1 table nic1
ip route add "10.33.0.1" dev eth0 scope link table nic1
ip route add "10.37.0.1" dev eth1 scope link table nic1

# NOTE: These route rules are not cleared by dhclient, they persist.
ip rule add from "10.33.0.54" table nic0
ip rule add from "10.37.0.54" table nic1
# ILB IP addresses are expected to be in the nic's subnet.
ip rule add from "10.33.0.54/20" table nic0
ip rule add from "10.37.0.54/20" table nic1
# Firewall marking
iptables -A PREROUTING -i eth0 -t mangle -j MARK --set-mark 1
iptables -A PREROUTING -i eth1 -t mangle -j MARK --set-mark 2
# Packets ingress nic0 egress nic1
ip rule add fwmark 1 table nic1
# Packets ingress nic1 egress nic0
ip rule add fwmark 2 table nic0
# Netblocks via VPC default gateways
ip route flush cache
ip rule

Related: GoogleCloudPlatform/compute-image-packages#475

My own efforts to troubleshoot this are at:

Process output longer than 64k swallows all remaining logging

This is basically the same feedback as #149, which I don't have permission to reopen.

Our customer just hit this problem. A process logging a line longer than 64k (the default buffer size for bufio.NewScanner) will swallow any output up to that point and after. Our process is a database that under some logging configurations logs query strings, which can be arbitrarily long. This makes it difficult to debug crashes in particular since any panic output or similar will not be logged.

Asking users to change the logging behavior of binaries they don't own is not a feasible workaround. Making the buffer size configurable via parameters / environment variables would be. Better yet: have fallback behavior when the scanner buffer overflows, rather than refusing to log output.

google_metadata_script_runner runs temporary script with bash -c, fails on noexc /tmp

The default appears to be that the scripts are being copied to a temp directory in /tmp, then executed via bash -c (see line 299) of google_metadata_script_runner/main.go):

                        c = exec.Command(config.Section("MetadataScripts").Key("default_shell").MustString("/bin/bash"), "-c", tmpFile)

However, if /tmp is on a noexec file system (which is not uncommon in locked down environment) this fails and one gets an error like

google_metadata_script_runner[523]: startup-script: /bin/bash: line 1: /tmp/metadata-scripts433978404/startup-script: Permission denied

Shouldn't it be possible to just let bash run the script directly, i.e. remove the "-c"?

It is true one can change the run_dir directory in the config to point to another directory as a work-around, but maybe the above suggestion could be applied to make it work in general?

google-guest-agent.service go to dead (inactive) when the VM is built with packer (image) and created with MIGs.

Environment

OS: Ubuntu 20.04 LTS
Kernel: 5.11.0-1020-gcp #22~20.04.1-Ubuntu SMP Tue Sep 21 10:54:26 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
SystemD version: systemd 245 (245.4-4ubuntu3.13)
Google Guess Agent version: 20210629.00-0ubuntu1~20.04.0

Problem

We use the packer to create images and launch with MIGs using templates. After Oct 04, 2021 we realized that images built by the packer and released with MIGs do not start the google-guest-agent service and this behavior does not allow the use of Oslogin to connect to virtual machines. With image created on Sep 17, 2021, this behavior does not occur.

The behavior only occurs on first startup by MIGs. If the MIG virtual machine with this behavior is manual restarted (shutdown -r now), the google-guest-agent service will be activated and it will be possible to connect the virtual machines using Oslogin in the next boot.

Details about debugging trying finding the root cause

The image provisioning process by the packer uses ansible and follows these order:

  • Packer creates a VM from Ubuntu 20.04 LTS (ubuntu-os-cloud/ubuntu-minimal-2004-lts);
  • Ansible applies the OS update and restarts the VM, if necessary;
  • Ansible waits for the VM to become available, if necessary;
  • Ansible provisions other services (like nginx);
  • Ansbile shuts down the VM and waits for it to be shut down;
  • Packer creates the image that will be used by the MIG templates.

The unit google-guest-agent.service go to dead (inactive) state after the first reboot by packer/ansible build process and the first boot by the MIG.

Logs before the virtual machine created by MIG is restarted

systemctl status google-guest-agent.service

root@xxx:~# systemctl status google-guest-agent
● google-guest-agent.service - Google Compute Engine Guest Agent
     Loaded: loaded (/lib/systemd/system/google-guest-agent.service; enabled; vendor preset: enabled)
     Active: inactive (dead)

systemd-analyze verify google-guest-agent.service

root@xxx:~# systemd-analyze verify google-guest-agent.service
snap-snapd-13170.mount: Unit is bound to inactive unit dev-loop1.device. Stopping, too.

systemd-analyze critical-chain google-guest-agent.service

root@xxx:~# systemd-analyze critical-chain google-guest-agent.service
The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

└─rsyslog.service @7.388s +111ms
  └─basic.target @7.169s
    └─sockets.target @7.157s
      └─snapd.socket @6.746s +268ms
        └─sysinit.target @6.366s
          └─cloud-init.service @4.478s +1.848s
            └─systemd-networkd-wait-online.service @3.374s +1.093s
              └─systemd-networkd.service @5.479s +65ms
                └─network-pre.target @3.293s
                  └─cloud-init-local.service @2.022s +1.258s
                    └─systemd-udev-trigger.service @858ms +202ms
                      └─systemd-udevd-kernel.socket @761ms
                        └─system.slice @505ms
                          └─-.slice @505ms

google-guest-agent.service logs while the packer/ansible build process is running

note: after the VM is created by MIG there is no more log in the google-guest-agent.service until the service or VM is manual restarted.

root@xxx:~# journalctl -xe --no-pager -u google-guest-agent.service
-- Logs begin at Wed 2021-10-06 20:51:59 UTC, end at Wed 2021-10-06 21:08:56 UTC. --
Oct 06 20:52:08 xxx systemd[1]: Started Google Compute Engine Guest Agent.
-- Subject: A start job for unit google-guest-agent.service has finished successfully
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- A start job for unit google-guest-agent.service has finished successfully.
--
-- The job identifier is 111.
Oct 06 20:52:08 xxx GCEGuestAgent[572]: 2021-10-06T20:52:08.8406Z GCEGuestAgent Info: GCE Agent Started (version 20210414.00-0ubuntu1~20.04.0)
Oct 06 20:52:09 xxx GCEGuestAgent[572]: 2021-10-06T20:52:09.1992Z GCEGuestAgent Info: Instance ID changed, running first-boot actions
Oct 06 20:52:09 xxx dhclient[662]: Internet Systems Consortium DHCP Client 4.4.1
Oct 06 20:52:09 xxx dhclient[662]: Copyright 2004-2018 Internet Systems Consortium.
Oct 06 20:52:09 xxx dhclient[662]: All rights reserved.
Oct 06 20:52:09 xxx dhclient[662]: For info, please visit https://www.isc.org/software/dhcp/
Oct 06 20:52:09 xxx dhclient[662]:
Oct 06 20:52:09 xxx dhclient[662]: Listening on Socket/ens4
Oct 06 20:52:09 xxx dhclient[662]: Sending on   Socket/ens4
Oct 06 20:52:09 xxx dhclient[662]: Created duid "\000\001\000\001(\360\310\371B\001\012\335\012\016".
Oct 06 20:52:09 xxx google_guest_agent[572]: 2021/10/06 20:52:09 logging client: rpc error: code = PermissionDenied desc = Request had insufficient authentication scopes.
Oct 06 20:52:10 xxx groupadd[658]: group added to /etc/group: name=google-sudoers, GID=1001
Oct 06 20:52:10 xxx groupadd[658]: group added to /etc/gshadow: name=google-sudoers
Oct 06 20:52:10 xxx groupadd[658]: new group: name=google-sudoers, GID=1001
Oct 06 20:52:10 xxx GCEGuestAgent[572]: 2021-10-06T20:52:10.7259Z GCEGuestAgent Info: Created google sudoers file
Oct 06 20:52:10 xxx GCEGuestAgent[572]: 2021-10-06T20:52:10.7262Z GCEGuestAgent Info: Adding existing user root to google-sudoers group.
Oct 06 20:52:10 xxx gpasswd[680]: user root added by root to group google-sudoers
Oct 06 20:52:10 xxx GCEGuestAgent[572]: 2021-10-06T20:52:10.7489Z GCEGuestAgent Info: Updating keys for user root.
Oct 06 20:52:11 xxx google_guest_agent[572]: 2021/10/06 20:52:11 logging client: rpc error: code = PermissionDenied desc = Request had insufficient authentication scopes.
Oct 06 20:53:17 xxx GCEGuestAgent[572]: 2021-10-06T20:53:17.5595Z GCEGuestAgent Info: GCE Agent Stopped
Oct 06 20:53:17 xxx systemd[1]: Stopping Google Compute Engine Guest Agent...
-- Subject: A stop job for unit google-guest-agent.service has begun execution
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- A stop job for unit google-guest-agent.service has begun execution.
--
-- The job identifier is 1133.
Oct 06 20:53:17 xxx systemd[1]: google-guest-agent.service: Succeeded.
-- Subject: Unit succeeded
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- The unit google-guest-agent.service has successfully entered the 'dead' state.
Oct 06 20:53:17 xxx systemd[1]: Stopped Google Compute Engine Guest Agent.
-- Subject: A stop job for unit google-guest-agent.service has finished
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- A stop job for unit google-guest-agent.service has finished.
--
-- The job identifier is 1133 and the job result is done.
Oct 06 20:53:17 xxx systemd[1]: Starting Google Compute Engine Guest Agent...
-- Subject: A start job for unit google-guest-agent.service has begun execution
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- A start job for unit google-guest-agent.service has begun execution.
--
-- The job identifier is 1133.
Oct 06 20:53:17 xxx GCEGuestAgent[2166]: 2021-10-06T20:53:17.5816Z GCEGuestAgent Info: GCE Agent Started (version 20210629.00-0ubuntu1~20.04.0)
Oct 06 20:53:17 xxx systemd[1]: Started Google Compute Engine Guest Agent.
-- Subject: A start job for unit google-guest-agent.service has finished successfully
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- A start job for unit google-guest-agent.service has finished successfully.
--
-- The job identifier is 1133.
Oct 06 20:53:17 xxx GCEGuestAgent[2166]: 2021-10-06T20:53:17.7284Z GCEGuestAgent Info: Updating keys for user root.
Oct 06 20:53:17 xxx GCEGuestAgent[2166]: 2021-10-06T20:53:17.7472Z GCEGuestAgent Error oslogin.go:109: Error restarting service: Failed to try-restart nscd.service: Unit nscd.service not found.
Oct 06 20:53:17 xxx GCEGuestAgent[2166]: 2021-10-06T20:53:17.7716Z GCEGuestAgent Error oslogin.go:109: Error restarting service: Failed to try-restart unscd.service: Unit unscd.service not found.                                                                                      
Oct 06 20:53:17 xxx dhclient[2231]: Internet Systems Consortium DHCP Client 4.4.1
Oct 06 20:53:17 xxx dhclient[2231]: Copyright 2004-2018 Internet Systems Consortium.
Oct 06 20:53:17 xxx dhclient[2231]: All rights reserved.
Oct 06 20:53:17 xxx dhclient[2231]: For info, please visit https://www.isc.org/software/dhcp/
Oct 06 20:53:17 xxx dhclient[2231]:
Oct 06 20:53:17 xxx dhclient[2231]: Listening on Socket/ens4
Oct 06 20:53:17 xxx dhclient[2231]: Sending on   Socket/ens4
Oct 06 20:53:17 xxx GCEGuestAgent[2166]: 2021-10-06T20:53:17.9344Z GCEGuestAgent Error oslogin.go:109: Error restarting service: Failed to try-restart cron.service: Unit cron.service not found.                                                                                                 
Oct 06 20:53:17 xxx GCEGuestAgent[2166]: 2021-10-06T20:53:17.9421Z GCEGuestAgent Error oslogin.go:109: Error restarting service: Failed to try-restart crond.service: Unit crond.service not found.                                                                                                  
Oct 06 20:53:18 xxx google_guest_agent[2166]: 2021/10/06 20:53:18 logging client: rpc error: code = PermissionDenied desc = Request had insufficient authentication scopes.
Oct 06 20:58:06 xxx systemd[1]: Stopping Google Compute Engine Guest Agent...
-- Subject: A stop job for unit google-guest-agent.service has begun execution
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- A stop job for unit google-guest-agent.service has begun execution.
--
-- The job identifier is 2265.
Oct 06 20:58:06 xxx systemd[1]: google-guest-agent.service: Succeeded.
-- Subject: Unit succeeded
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- The unit google-guest-agent.service has successfully entered the 'dead' state.
Oct 06 20:58:06 xxx systemd[1]: Stopped Google Compute Engine Guest Agent.
-- Subject: A stop job for unit google-guest-agent.service has finished
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- A stop job for unit google-guest-agent.service has finished.
--
-- The job identifier is 2265 and the job result is done.

systemd-analyze plot with the inactive (dead) state: systemd-analyze-plot-boot-problem.svg.gz

Logs after the virtual machine created by MIG is manual restarted (shutdown -r now).

systemctl status google-guest-agent.service

root@xxx:~# systemctl status google-guest-agent.service
● google-guest-agent.service - Google Compute Engine Guest Agent
     Loaded: loaded (/lib/systemd/system/google-guest-agent.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2021-10-06 21:22:59 UTC; 6min ago
   Main PID: 440 (google_guest_ag)
      Tasks: 9 (limit: 4403)
     Memory: 20.3M
     CGroup: /system.slice/google-guest-agent.service
             └─440 /usr/bin/google_guest_agent

Oct 06 21:22:59 xxx dhclient[580]: Listening on Socket/ens4
Oct 06 21:22:59 xxx dhclient[580]: Sending on   Socket/ens4
Oct 06 21:22:59 xxx GCEGuestAgent[440]: 2021-10-06T21:22:59.6917Z GCEGuestAgent Error oslogin.go:109: Error restarting service: Failed to try-restart nscd.service: Unit nscd.service not found.
                                                               .
Oct 06 21:22:59 xxx GCEGuestAgent[440]: 2021-10-06T21:22:59.7090Z GCEGuestAgent Error oslogin.go:109: Error restarting service: Failed to try-restart unscd.service: Unit unscd.service not found.
                                                               .
Oct 06 21:22:59 xxx GCEGuestAgent[440]: 2021-10-06T21:22:59.9446Z GCEGuestAgent Error oslogin.go:109: Error restarting service: Failed to try-restart crond.service: Unit crond.service not found.
                                                               .
Oct 06 21:23:00 xxx oslogin_cache_refresh[690]: Refreshing passwd entry cache
Oct 06 21:23:00 xxx oslogin_cache_refresh[690]: Refreshing group entry cache
Oct 06 21:23:00 xxx oslogin_cache_refresh[690]: Failure getting groups, quitting
Oct 06 21:23:00 xxx oslogin_cache_refresh[690]: Failed to get groups, not updating group cache file, removing /etc/oslogin_group.cache.bak.
Oct 06 21:23:00 xxx google_guest_agent[440]: 2021/10/06 21:23:00 logging client: rpc error: code = PermissionDenied desc = Cloud Logging API has not been used in project 407489596486 before or it is disabled. Enable it by visiting >

systemd-analyze verify google-guest-agent.service

root@xxx:~# systemd-analyze verify google-guest-agent.service
snap-snapd-13170.mount: Unit is bound to inactive unit dev-loop1.device. Stopping, too.
snap-core18-2128.mount: Unit is bound to inactive unit dev-loop2.device. Stopping, too.

systemd-analyze critical-chain google-guest-agent.service

root@xxx:~# systemd-analyze critical-chain google-guest-agent.service
The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

google-guest-agent.service +2.203s
└─rsyslog.service @6.174s +104ms
  └─basic.target @6.079s
    └─sockets.target @6.067s
      └─snapd.socket @5.979s +63ms
        └─sysinit.target @5.823s
          └─cloud-init.service @5.028s +758ms
            └─systemd-networkd-wait-online.service @3.220s +1.797s
              └─systemd-networkd.service @3.156s +53ms
                └─network-pre.target @3.143s
                  └─cloud-init-local.service @1.893s +1.239s
                    └─systemd-udev-trigger.service @777ms +182ms
                      └─systemd-udevd-kernel.socket @659ms
                        └─system.slice @392ms
                          └─-.slice @392ms

systemd-analyze plot with the active (running) state: systemd-analyze-plot-boot-ok.svg.gz

google-guest-agent.service dependency graph

dependency-graph

Reproduction steps

  • Create a image with packer using ubuntu-os-cloud/ubuntu-minimal-2004-lts
  • Add the image in a MIGs template
  • Launch the image in the MIGs
  • Try to connect with OsLogin

Workaround

  • Create a /etc/rc.local script with the command /usr/bin/systemctl restart google-guest-agent.

I know this isn't the most elegant way to fix the problem.

I had the same behavior using version 20210414.00-0ubuntu1~20.04.0 of google-guest-agent.

I believe it is not an agent-related issue but I don't know enough about this project to continue debugging the problem by myself

Please let me know if there is any additional information I can provide that will be helpful.

Thanks,

google-startup-scripts service no longer waits for cloud-init/snapd.seeded

Filed in Ubuntu bug #1901033 as well.

We have an Ubuntu cloud-images qualification test for google startup scripts to ensure that cloud-init customizations are available before the user startup script is run. That test is failing and investigation shows that we have a functional regression from gce-compute-image-packages to the new google-guest-agent.

Without waiting for cloud-final and snapd.seeded we won't present a consistent system for users to run scripts that have archive mirrors set up, GCE's google-cloud-sdk snap installed, users in the proper groups, or other customizations owned by those services. We are not holding groovy release on this bug, but we would like this treated with priority and it is a blocker for use of the guest-agent package in older Ubuntu releases.

/usr/lib/systemd/system/google-startup-scripts.service in focal from gce-compute-image-packages:

[Unit]
Description=Google Compute Engine Startup Scripts
After=network-online.target network.target rsyslog.service
After=google-instance-setup.service google-network-daemon.service
After=cloud-final.service multi-user.target
Wants=cloud-final.service
After=snapd.seeded.service
Wants=snapd.seeded.service

/usr/lib/systemd/system/google-startup-scripts.service in groovy from google-guest-agent:

[Unit]
Description=Google Compute Engine Startup Scripts
Wants=network-online.target rsyslog.service
After=network-online.target rsyslog.service google-guest-agent.service
Before=apt-daily.service

Publish recent release packages

The last release published to https://launchpad.net/ubuntu/+source/google-guest-agent (for Focal) is 20220622. Six releases have been made since then, and we're affected by a bug1 that looks like it was fixed in 20220927 specifically. Is it possible to have a new deb published? Should I open an issue for that on launchpad?

Alternatively, should we be adding a different apt repo to our GCE instances to get these updates?


1#157 (comment), which I think is affecting quite a few people based on the dates of google search results.

Package google-guest-agent install/re-install/upgrade triggers the shutdown-scripts to be executed on a running instance.

Steps to reproduce. (tested on ubuntu 18.04)

  1. create a GCE vm with base ubuntu 18.04
  2. add metadata startup-script and shutdown-script as documented
  3. once the vm is booted and running past startup scripts have executed run sudo apt-get install --reinstall google-guest-agent
  4. watch the shutdown scripts being triggered and shutting down services.

Problem

In our case the vm is running our master mysql database. The intention is we do not want/have to stop the mysql service unless shutting down for maintenance or fail-overs. Updating a google managed package i would not consider it to trigger the shutdown sequence. (this seems to be a side effect of packaging the google-shutdown-scripts, service and package as part of the same google-guest-agent package.

The reason we ended up here is that the google-guest-agent v20201217.02-0ubuntu1~18.04.0 has an issue not adding the required route local 172.31.6.35 dev ens4 proto 66 scope host which was causing the ilb health-checks to fail. So contacting gcp support it was suggested to update that package and doing so triggered the shutdown of mysql causing an outage. This is not even documented and I believe it must be documented in the above docs when writing shutdown scripts so that ppl are aware of it. As most shutdown scripts are intended for graceful shutdowns and closing connections. (and not a service restart)

The issue can also be fixed by splitting the packages so we know that installing/updating the script runner packages would cause the shutdown script to run. Or the current package install steps should ignore restarting the script runner services post install/upgrade.

Jul 28 16:46:10 instance-1 sudo[3025]: pramji_company_com : TTY=pts/0 ; PWD=/home/pramji_company_com ; USER=root ; COMMAND=/usr/bin/apt install --reinstall google-guest-agent
Jul 28 16:46:10 instance-1 sudo[3025]: pam_unix(sudo:session): session opened for user root by pramji_company_com(uid=0)
Jul 28 16:46:12 instance-1 sshd[3066]: Did not receive identification string from 35.191.1.32 port 34808
Jul 28 16:46:13 instance-1 systemd[1]: Reloading.
Jul 28 16:46:13 instance-1 systemd[1]: Starting Message of the Day...
Jul 28 16:46:13 instance-1 systemd[1]: Reloading.
Jul 28 16:46:13 instance-1 GCEGuestAgent[2561]: 2021-07-28T16:46:13.9101+10:00 GCEGuestAgent Info: GCE Agent Stopped
Jul 28 16:46:13 instance-1 systemd[1]: Stopping Google Compute Engine Guest Agent...
Jul 28 16:46:13 instance-1 systemd[1]: Stopping Google Compute Engine Shutdown Scripts...
Jul 28 16:46:13 instance-1 GCEMetadataScripts[3209]: 2021/07/28 16:46:13 GCEMetadataScripts: Starting shutdown scripts (version 20210414.00-0ubuntu1~18.04.0).
Jul 28 16:46:13 instance-1 systemd[1]: Stopped Google Compute Engine Guest Agent.
Jul 28 16:46:13 instance-1 systemd[1]: Started Google Compute Engine Guest Agent.
Jul 28 16:46:13 instance-1 systemd[1]: Starting Google Compute Engine Startup Scripts...
Jul 28 16:46:13 instance-1 GCEMetadataScripts[3209]: 2021/07/28 16:46:13 GCEMetadataScripts: Found shutdown-script in metadata.
Jul 28 16:46:13 instance-1 GCEMetadataScripts[3234]: 2021/07/28 16:46:13 GCEMetadataScripts: Starting startup scripts (version 20210414.00-0ubuntu1~18.04.0).
Jul 28 16:46:13 instance-1 test_shutdown_script[3266]: starting shutdown sequence
Jul 28 16:46:13 instance-1 GCEGuestAgent[3232]: 2021-07-28T16:46:13.9657+10:00 GCEGuestAgent Info: GCE Agent Started (version 20210414.00-0ubuntu1~18.04.0)
Jul 28 16:46:13 instance-1 GCEMetadataScripts[3234]: 2021/07/28 16:46:13 GCEMetadataScripts: Found startup-script in metadata.

More documentation about "Compute Engine cannot guarantee that the shutdown script will complete"

The https://cloud.google.com/compute/docs/shutdownscript#limitations states:

- Compute Engine executes shutdown scripts only on a best-effort basis. In rare cases, Compute Engine cannot guarantee that the shutdown script will complete.

I spent a lot of hours to make sure the shutdown script will be complete, because the correctly defined shutdown script has not been invoked or if it's invoked, it was killed after a few moments.

The used configuration is:

  • VM provisioning model: Spot
  • E2 medium
  • Ubuntu 20.04 LTS and Ubuntu 20.04 LTS Minimal (provided by GCE VM wizard)
  • Scripts in Metadata: startup-script, shutdown-script

See more details at the end of https://faun.pub/minecraft-server-on-digitalocean-with-vpn-140730681e3a , section Google Cloud and the output of CREATE SIMILAR wizard (except SSH info):

gcloud compute instances create minecraft-1 --project=basic-computing-360703 --zone=europe-central2-a --machine-type=e2-medium --network-interface=network-tier=PREMIUM,subnet=minecraft-i --metadata=^,@^shutdown-script=/home/kodcsakany/docker-minecraft-server/spot_mc.sh\ stop,@startup-script=/home/kodcsakany/docker-minecraft-server/spot_mc.sh\ start --no-restart-on-failure --maintenance-policy=TERMINATE --preemptible --provisioning-model=SPOT --instance-termination-action=STOP --service-account=843003590676-compute@developer.gserviceaccount.com --scopes=https://www.googleapis.com/auth/devstorage.read_only,https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring.write,https://www.googleapis.com/auth/servicecontrol,https://www.googleapis.com/auth/service.management.readonly,https://www.googleapis.com/auth/trace.append --create-disk=auto-delete=yes,boot=yes,device-name=minecraft,image=projects/ubuntu-os-cloud/global/images/ubuntu-minimal-2004-focal-v20220817,mode=rw,size=50,type=projects/basic-computing-360703/zones/europe-central2-a/diskTypes/pd-standard --no-shielded-secure-boot --shielded-vtpm --shielded-integrity-monitoring --reservation-affinity=any

So, after forking the source code of https://github.com/GoogleCloudPlatform/guest-agent/tree/main/google_metadata_script_runner and adding debug messages, I realized, the script runner works well. The root cause is around the Systemd services.

I suggest to add a new paragraph about troubleshooting tips for shutdown script invocation:


Introduction:

  • The startup and shutdown scripts are invoked by Systemd services, provided by Google.
  • Google cannot guarantee, the stop will be called on Systemd service of shutdown by the Systemd in 30 s and it will be finished on time. It depends on another Systemd sercvices, not on Google.

Design rules:

  • If the startup or shutdown script was not invoked, make sure again, the Metadata is set both of them, because of a bug it's not possible to save startup and shutdown script together.
  • Use only the really needed Systemd services, because a few of them can block the stop of other services, for example: unattended-upgrades.service. If the blocking stop operation takes long time, the TimeoutStopSec can be decreased by systemctl edit unattended-upgrades.service or disable shutdown-time upgrade in /etc/apt/apt.conf.d/50unattended-upgrades
  • The Minimal images have less Systemd services. Each Systemd service stop consumes the short time of shutdown (30 s).
  • Do not write extra logs of shutdown script to /tmp directory, because it will be deleted next boot.
  • If a service must run until the stop operation of shutdown script, add dependency to the shutdown script service, for example (service: docker.service) sudo systemctl edit google-shutdown-scripts.service:
[Unit]
Requires=docker.service
After=docker.service

Troubleshooting tips:

  • If the shutdown script was not invoked, get the last few minutes of the logs by sudo journalctl --since '10 min ago' and find the shutdown section between System is powering down. and -- Reboot --. Identify messages of Google Compute Engine Shutdown Scripts and google_metadata_script_runner, check the the distance of invocation from the timestamp of powering down message and Finished to -- Reboot -- message.
  • The logs of shutdown script can be printed out by sudo journalctl -u google-shutdown-scripts command.
  • The status of shutdown script runner service can print out by systemctl status google-shutdown-scripts.service command.

Improve handling of large process output when running metadata-script

Things don't seem to be going well in the handling of large process output, either because of or in spite of pull #140. In an instance log I see:

google_metadata_script_runner[980]: error while communicating with "startup-script-url" script: bufio.Scanner: token too long
google_metadata_script_runner[980]: 2021/12/22 18:51:17 logging client: rpc error: code = PermissionDenied desc = Request had insufficient authentication scopes.
google_metadata_script_runner[980]: startup-script-url signal: broken pipe

With pull #140 the pipe is now closed, but closing the pipe can mean that the write() will fail (in my experimentation it actually throws a signal and kills the process, rather than just returning an error). I'm not sure if it's intentional to be doing something to lead to the likely killing/crash of the process. Perhaps what should be done instead is leave the pipe open so the process can still run, though to be more robust guest-agent would still need to read from the pipe so the child process doesn't block if the pipe eventually fills up.

A simple fix that might help reduce the "token too long" issue is to increase the amount of buffering, e.g.:

in.Buffer(make([]byte, 0, 4 * 1024), 2 * 1024 * 1024)

The context for this is that I'm running cloud builder, which seem to use this functionality to run the build in an instance and seems to get hung up on this issue. The particular command being run is "gsutil -m cp -r $COMPONENTS ." to copy ~6 files, and I presume something went wrong with gsutil that caused it to generate a bunch of text output that overwhelmed what the text scanner can handle.

nscd, unscd, cron and crond fail to restart

When google-guest-agent tries to start, it seems to try to start nscd, unscd, cron and crond, but those units are not present on our servers.

$ uname -a
Linux server-2 5.11.0-1020-gcp #22~20.04.1-Ubuntu SMP Tue Sep 21 10:54:26 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
# Ubuntu 20.04 LTS Minimal

$ systemctl status google-guest-agent
● google-guest-agent.service - Google Compute Engine Guest Agent
     Loaded: loaded (/lib/systemd/system/google-guest-agent.service; enabled; vendor preset: enabled)
     Active: active (running) since Sat 2021-10-16 18:02:28 UTC; 19min ago
   Main PID: 555 (google_guest_ag)
      Tasks: 12 (limit: 9536)
     Memory: 20.8M
     CGroup: /system.slice/google-guest-agent.service
             └─555 /usr/bin/google_guest_agent

Oct 16 18:02:27 server-2 dhclient[620]: All rights reserved.
Oct 16 18:02:27 server-2 dhclient[620]: For info, please visit https://www.isc.org/software/dhcp/
Oct 16 18:02:27 server-2 dhclient[620]: 
Oct 16 18:02:27 server-2 dhclient[620]: Listening on Socket/ens4
Oct 16 18:02:27 server-2 dhclient[620]: Sending on   Socket/ens4
Oct 16 18:02:28 server-2 systemd[1]: Started Google Compute Engine Guest Agent.
Oct 16 18:02:28 server-2 GCEGuestAgent[555]: 2021-10-16T18:02:28.9221Z GCEGuestAgent Error oslogin.go:109: Error restarting service: Failed to try-restart nscd.service: Unit nscd.service not found.
                                                     .
Oct 16 18:02:29 server-2 GCEGuestAgent[555]: 2021-10-16T18:02:29.5818Z GCEGuestAgent Error oslogin.go:109: Error restarting service: Failed to try-restart unscd.service: Unit unscd.service not found.
                                                     .
Oct 16 18:02:29 server-2 GCEGuestAgent[555]: 2021-10-16T18:02:29.8194Z GCEGuestAgent Error oslogin.go:109: Error restarting service: Failed to try-restart cron.service: Unit cron.service not found.
                                                     .
Oct 16 18:02:29 server-2 GCEGuestAgent[555]: 2021-10-16T18:02:29.8254Z GCEGuestAgent Error oslogin.go:109: Error restarting service: Failed to try-restart crond.service: Unit crond.service not found.
                                                     .

Are these benign? If so, can they be downgraded from Errors?

These same error lines appear in #134, but in my case, the service is active/running, not dead.

Cloud logging fails due to metadata value propagation

The agent's cloud logging export fails since #82.

google_guest_agent[411]: 2020/12/14 18:11:02 logging client: rpc error: code = InvalidArgument desc = Name "projects//logs/GCEGuestAgent" is missing the parent component. Expected the form projects/[PROJECT_ID]/logs/[ID]

The project string is empty (not getting populated from metadata). This isn't distro specific and is consistent with the updated agent.

MetadataRunner should (possibly) ignore whitespace at top of string

If you create a metadata script with an extra line or white space at the beginning, it will attempt to execute in the default shell. So if you have a script that tries to use python or another shell and has an extra line, it will fail. We should consider stripped out extra whitespace from the beginning of the value. This is an edge case regression from the previous agent.

Setting MetadataScripts startup to false may result in google-startup-scripts service to enter failed state

Problem:
After disabling startup scripts in the instance config, the google-startup-scripts service may enter a failed state after rebooting the associated VM.

Expectation:
The google-startup-scripts service does not enter a failed state, after booting, because startup scripts are disabled in the instance config.


Snippet of detected failure:

$ systemctl --failed --all
  UNIT                           LOAD   ACTIVE SUB    DESCRIPTION
● google-startup-scripts.service loaded failed failed Google Compute Engine Startup Scripts

Take a look at the service's status:

$ sudo systemctl status google-startup-scripts.service
● google-startup-scripts.service - Google Compute Engine Startup Scripts
   Loaded: loaded (/lib/systemd/system/google-startup-scripts.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Mon 2021-04-19 18:32:44 UTC; 3min 6s ago
  Process: 3016 ExecStart=/usr/bin/google_metadata_script_runner startup (code=exited, status=2)
 Main PID: 3016 (code=exited, status=2)

Apr 19 18:32:44 aa-qa-6080-gcp0 systemd[1]: Starting Google Compute Engine Startup Scripts...
Apr 19 18:32:44 aa-qa-6080-gcp0 google_metadata_script_runner[3016]: startup scripts disabled in instance config
Apr 19 18:32:44 aa-qa-6080-gcp0 systemd[1]: google-startup-scripts.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Apr 19 18:32:44 aa-qa-6080-gcp0 systemd[1]: google-startup-scripts.service: Failed with result 'exit-code'.
Apr 19 18:32:44 aa-qa-6080-gcp0 systemd[1]: Failed to start Google Compute Engine Startup Scripts.

Inspect contents of /etc/default/instance_configs.cfg:

$ cat /etc/default/instance_configs.cfg | tail -8
#
# Disable user supplied startup/shutdown scripts from running on
# the engine.
#
[MetadataScripts]
shutdown = false
startup = false
# END ANSIBLE MANAGED BLOCK

The service logged that it failed due to the result of an an "exit-code", let's take a closer look:

$ sudo google_metadata_script_runner startup
startup scripts disabled in instance config

$ echo $?
2

Details of the google-guest-agent package:

$ dpkg-query --status google-guest-agent
Package: google-guest-agent
Status: install ok installed
Priority: optional
Section: devel
Installed-Size: 23901
Maintainer: Ubuntu Developers <[email protected]>
Architecture: amd64
Version: 20201217.02-0ubuntu1~18.04.0
Replaces: gce-compute-image-packages (<< 20191115)
Depends: libc6 (>= 2.4)
Breaks: gce-compute-image-packages (<< 20191115), python3-google-compute-engine
Description: Google Compute Engine Guest Agent
 Contains the guest agent and metadata script runner binaries.
Built-Using: golang-1.13 (= 1.13.8-1ubuntu1~18.04.2)
Homepage: https://github.com/GoogleCloudPlatform/guest-agent

Please let me know if there is any additional information I can provide that will be helpful to reproduce, diagnose, or address the issue.

Our expectation was that by following the instructions (found here: https://github.com/GoogleCloudPlatform/guest-agent#configuration) to disable startup scripts, the associated services would continue to execute gracefully. It was an unexpected result to find the google-startup-scripts service in a failed state.

Bug determining expired keys in accountsMgr.diff()

This block uses getUserKeys to determine if there are any keys that have expired since the last iteration. However, the keys in sshKeys are not of the format that getUserKeys expects (no "user:" prefix), so they always hit this block, which means that getUserKeys returns an empty list, and diff always returns true.

Error getting config status, workload certificates may not be configured since 20230202.00

Since the release seeing these errors. Using terraform to create docker+machine executor for gitlab. Was working flawlessly for months prior to most recent release.

Feb  3 11:08:08 gitlab-ci-runner google_metadata_script_runner: startup-script: Installing Docker...
Feb  3 11:10:43 gitlab-ci-runner systemd: Starting GCE Workload Certificate refresh...
Feb  3 11:10:43 gitlab-ci-runner gce_workload_cert_refresh: 2023/02/03 11:10:43: Error getting config status, workload certificates may not be configured: HTTP 404
Feb  3 11:10:43 gitlab-ci-runner gce_workload_cert_refresh: 2023/02/03 11:10:43: Done
Feb  3 11:10:43 gitlab-ci-runner systemd: Started GCE Workload Certificate refresh.

Delete routing table gracefully

We observed connection time out in our client-side when scaling in or performing a rolling update of MIGs,

The workflow is,
Client -> ILB -> MIGs instance.

Before ACPI G2 soft off signal received[1], the forward ip(IP of ILB) was already deleted in routing table. This caused these long connections will be a timeout.

[1]
image

Related source code,
https://github.com/GoogleCloudPlatform/guest-agent/blob/master/google_guest_agent/addresses.go#L407

Is there any way to perform this more graceful? Any advice will very helpful.

Vendored unix module fails to build due to Go version too old in go.mod

I was just trying to update the guest-agent package in openSUSE and the build fails with:

[   18s] + go build '-ldflags=-s -w -X main.version=20230426.00' -mod=vendor
[   18s] # golang.org/x/sys/unix
[   18s] ../vendor/golang.org/x/sys/unix/syscall.go:83:16: unsafe.Slice requires go1.17 or later (-lang was set to go1.16; check go.mod)
[   18s] ../vendor/golang.org/x/sys/unix/syscall_linux.go:2271:9: unsafe.Slice requires go1.17 or later (-lang was set to go1.16; check go.mod)
[   18s] ../vendor/golang.org/x/sys/unix/syscall_unix.go:118:7: unsafe.Slice requires go1.17 or later (-lang was set to go1.16; check go.mod)
[   18s] ../vendor/golang.org/x/sys/unix/sysvshm_unix.go:33:7: unsafe.Slice requires go1.17 or later (-lang was set to go1.16; check go.mod)
[   23s] error: Bad exit status from /var/tmp/rpm-tmp.Pw9TCI (%build)

due to the Go version in go.mod being too old.

Please bump the Go version in go.mod to 1.17 or later.

Cleanup references to distro logic that should be generic

In this PR, we are fixing when to disable NetworkManager but doing it only for distro logic which we assume to be the case. Instead, we should actually look for NetworkManager to be active (installed or otherwise) to do this as opposed to only looking for what we know the baseline case to be for EL7/8.

#50

Consider creating OWNERS files

cc @hopkiw

The approve plugin expects there to be OWNERS files in the repo. Consider creating them and/or filing a PR to disable the approve plugin in these repos.

{
 insertId: "3206wog42gdrsb"  
 jsonPayload: {
  author: "adjackura"   
  component: "hook"   
  event-GUID: "a5d1a300-ebb8-11e9-9868-406a0c1c7f7f"   
  event-type: "issue_comment"   
  file: "prow/plugins/approve/approvers/owners.go:164"   
  func: "k8s.io/test-infra/prow/plugins/approve/approvers.Owners.GetSuggestedApprovers"   
  level: "warning"   
  msg: "Couldn't find/suggest approvers for each files. Unapproved: [""]"   
  org: "GoogleCloudPlatform"   
  plugin: "approve"   
  pr: 2   
  repo: "guest-agent"   
  url: "https://github.com/GoogleCloudPlatform/guest-agent/pull/2#issuecomment-540843760"   
 }
 labels: {…}  
 logName: "projects/oss-prow/logs/hook"  
 receiveTimestamp: "2019-10-10T23:49:58.731720740Z"  
 resource: {…}  
 severity: "ERROR"  
 timestamp: "2019-10-10T23:49:53Z"  
}

Clock skew Debian 10

Hi,

I've noticed some clock skew on compute engine instances running debian 10 (debian-10-buster-v20200805).
Running /sbin/hwclock --hctosys -u --noadjfile fixed the issue.
But I can't seem to find a conclusive answer as to how the image is syncing the clock.
As far as I understand now the guest agent runs the clock sync on start and migration?
But there is nothing that runs the sync periodically to keep the clock in sync? I don't see anything that suggest ntp is configured.
(https://cloud.google.com/compute/docs/instances/managing-instances#configure_ntp_for_your_instances talks about setting it up but I get the feeling that is more for custom images?)
Setting up my own cronjob to run the sync seems weird as I'd expect the image to do this out of the box.
Am I missing something? Is there something that can mess with running the sync (like startupscript, apt upgrades, shielded vm or...)?
This did not seem to be an issue with debian 9 images.

Removing IPv6 forwarded IPs fails

The current implementation of getLocalRoutes on UNIXlike systems does not obtain IPv6 routes, so it will never correctly reconcile the desired and actual routes correctly. It will fail to remove existing routes that should be removed.

Google Guest Agent hangs when starting with no IP on the main interface

Description

When the google-guest-agent service starts but the primary interface has no IP, it remains stuck.
The status of the service is constantly changing between activating and deactivating.

This causes other dependent networking commands like systemctl start systemd-networkd, and netplan apply to also get stuck.

google-guest-agent.service - Google Compute Engine Guest Agent
   Loaded: loaded (/lib/systemd/system/google-guest-agent.service; enabled; vend
or preset: enabled)
   Active: deactivating (stop-sigterm) (Result: timeout) since Wed 2022-08-24 16
:54:26 UTC; 23min ago
 Main PID: 25708 (google_guest_ag)
    Tasks: 10 (limit: 4660)
   CGroup: /system.slice/google-guest-agent.service
           └─25708 /usr/bin/google_guest_agent

Aug 24 17:15:58 fresh systemd[1]: Starting Google Compute Engine Guest Agent...
Aug 24 17:15:58 fresh google_guest_agent[25708]: GCE Agent Started (version 2022
0622.00-0ubuntu2~18.04.0)
Aug 24 17:17:28 fresh systemd[1]: google-guest-agent.service: Start operation ti
med out. Terminating.
Aug 24 17:17:43 fresh google_guest_agent[25708]: CRITICAL main.go:298 error regi
stering service: failed to shutdown within timeout 15s

Setup

I'm using a GCP VM with Ubuntu 18.0.4 image.
The image ID (sourceImage) is projects/ubuntu-os-cloud/global/images/ubuntu-1804-bionic-v20220810.

One primary interface. No special configurations.

Steps to reproduce

systemctl stop google-guest-agent
ifconfig ens4 0
systemctl start google-guest-agent

The last command will hang and not finish, and no IP is configured on the interface.
Potentially nothing is avoiding acquiring a new IP.
When this happens, netplan apply finishes but doesn't resolve the IP.
systemctl restart systemd-networkd allows recovery after the agent failed to deactivate, but the subsequent activation succeeded.

Note

This commit made this stuck issue pop up more frequently since starting the systemd-networkd, causes google-guest-agent to start.
So, If the systemd-networkd process completes successfully, but the main interface does not get an IP (for whatever reason), the start of google-guest-agent will be stuck.

Update the search domain

A user has the option of setting a FQDN when creating an instance. The agent should expand the search domain by the domain provided as part of the FQDN set by the user.

The default search domain is provided by the DHCP server of GCE the addition of the search domain for a custom FQDN hostname provided should be handled by the agent.

This is distribution specific. For SUSE the /etc/sysconfig/network/config should be modified and the NETCONFIG_DNS_STATIC_SEARCHLIST seting should be expanded.

google-guest-agent fails to insert keys if the username includes a space

In windows it is possible for the user name to include a space such that the metadata record can look something like this:
flast:ssh-rsa RSA-KEY-GOES-HERE FIRSTS-PC\FIRST LAST@FIRSTS-PC

Prior to commit fbdff753ac709e2dc00305eee0661ae7da6e4cf4 non_windows_accounts.go was simply taking everything before the colon as the username and everything after as the key. However that commit changed it so that a split on spaces is performed to handle optional expiry data. This results in an error that looks something like this:
ERROR non_windows_accounts.go:199 invalid character 'L' looking for beginning of value: flast:ssh-rsa
It looks like rather than parsing the above metadata as:

  • Username: flast
  • Key: RSA-KEY-GOES-HERE FIRSTS-PC\FIRST LAST@FIRSTS-PC

It's instead parsing it as:

  • Username: flast
  • Key: RSA-KEY-GOES-HERE FIRSTS-PC\FIRST
  • Expiry: LAST@FIRSTS-PC

This is preventing windows users with spaces in their username from accessing any VM with the updated google-guest-agent package.

Support chrony for time syncing

We currently only try to call ntpdate when syncing the clock after a migration event. However, most newer distros use chrony (or something else). Re-evaluate this code and maybe look at the default ntp client.

shutdown script run on update of package

I am running a ubuntu 22.04 image and the apt-daily-upgrade.timer is active. This causes an update of packages and when updating the guest agent package it runs the shutdown script. This makes our software make a clean shutdown as we expect that the node will be terminated. It is however not, so we now have a dead node.
As a workaround I disabled the daily apt update in order to not kill our nodes and activated healthchecks in the instance group. But I dont think the shutdown script should run for an update of the guest agent package.

I do have some logs that I can send if required

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.