cloudfoundry / bosh-warden-cpi-release Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 10.0 5.01 MB

BOSH Garden (used to be Warden) CPI

License: Apache License 2.0

Shell 11.12% HTML 0.57% Go 88.31%

bosh-warden-cpi-release's People

Contributors

Stargazers

Watchers

Forkers

stefanschneider luan vmware-archive daniellavoie profe55orx andyliuliming andrew-su licshire crhntr

bosh-warden-cpi-release's Issues

All actions are not expecting arguments (below check causes issues for info call)

Hi @cppforlife

We are trying to update bosh-cli to utilize Info() action to extract api_version for V2 support.

During our testing following issue emerge based on Info() action design. We are not passing any arguments to CPI which is causing CPI to error out:

CPI 'info' method responded with error: CmdError{"type":"Bosh::Clouds::CpiError","message":"Must provide 'arguments' key","ok_to_retry":false}

I kind of like this argument validation but not sure that we should just pass blank argument for Info() right now??

Argument checker:

https://github.com/cppforlife/bosh-warden-cpi-release/blob/master/src/github.com/cppforlife/bosh-warden-cpi/action/info.go#L13-L14

Info action:

https://github.com/cppforlife/bosh-warden-cpi-release/blob/master/src/github.com/cppforlife/bosh-cpi-go/rpc/json_dispatcher.go#L63-L65

Deploying CF to BOSH-lite on GCP

We have been using bosh-lite to to deploy CF on GCP. The deployment has been working well, however when running CATs on the deployed instances, we have been seeing extreme inconsistencies, most runs have approximately 8 failures, with minimal consistency as to which tests fail.

We have been deploying on GCP using a n1-standard-16 (16 vCPU and 60Gb RAM). We have tried tweaking the machine to use local SSDs for each of /var/vcap/store and /var/vcap/data since it appeared to be an IOPS issue, however that has not noticeably altered the success rate.

We have been running CATs using this CPI on AWS (using garden-linux) quite successfully, in attempting to shift to GCP we have used both garden-linux and garden-runc, but both have given similar results.

The most apparent failures are 502 Bad Gateway: Registered endpoint failed to handle the request, which makes us suspect something to do with container networking.

Is there an expectation that bosh-warden-cpi-release would work well with GCP?

Error deleting VM (plain `exit status 1`)

Tried to delete the deployment, got this error:

Error: CPI error 'Bosh::Clouds::CloudError' with message 'Deleting vm '{{1d178d46-ad2e-449d-662b-d0965906cc3e}}': exit status 1' in 'delete_vm' CPI method

We've seen this multiple times in CI. Using warden CPI version 37.

Debug logs attached.
warden-debug.log

Creating container: the requested IP is already allocated

Hi @cppforlife,

We have started using Bosh Lite on top of a Softlayer machines some week ago, and we started seeing the following error when having to create new VMs:

D, [2018-04-03T12:21:10 #166922] [task:32] DEBUG -- DirectorJobRunner: (0.000099s) BEGIN
D, [2018-04-03T12:21:10 #166922] [task:32] DEBUG -- DirectorJobRunner: (0.000477s) INSERT INTO "events" ("parent_id", "timestamp", "user", "action", "object_type", "object_name", "error", "task", "deployment", "instance", "context_json") VALUES (NULL, '2018-04-03 12:21:10.230788+0000', 'admin', 'release', 'lock', 'lock:deployment:cf', NULL, '32', 'cf', NULL, '{}') RETURNING *
D, [2018-04-03T12:21:10 #166922] [task:32] DEBUG -- DirectorJobRunner: (0.000916s) COMMIT
D, [2018-04-03T12:21:10 #166922] [task:32] DEBUG -- DirectorJobRunner: (0.000124s) BEGIN
D, [2018-04-03T12:21:10 #166922] [task:32] DEBUG -- DirectorJobRunner: (0.000309s) UPDATE "tasks" SET "event_output" = ("event_output" || '{"time":1522758070,"error":{"code":100,"message":"CPI error ''Bosh::Clouds::CloudError'' with message ''Creating VM with agent ID ''{{f06f4c97-3732-4821-9682-2ba420d9235a}}'': Creating container: the requested IP is already allocated'' in ''create_vm'' CPI method"}}
') WHERE ("id" = 32)
D, [2018-04-03T12:21:10 #166922] [task:32] DEBUG -- DirectorJobRunner: (0.000821s) COMMIT
E, [2018-04-03T12:21:10 #166922] [task:32] ERROR -- DirectorJobRunner: CPI error 'Bosh::Clouds::CloudError' with message 'Creating VM with agent ID '{{f06f4c97-3732-4821-9682-2ba420d9235a}}': Creating container: the requested IP is already allocated' in 'create_vm' CPI method

Our workaround for this has been to recreate the Bosh director every morning, because we keep seeing this constantly.

This happens at least two different scenarios:

When running errands without --keep-alive
When the stemcell needs to be updated

We have used latest bosh-deployment and garden-runc to deploy this director.

I started debugging this problem for the case of running errands, I found the following:

When the execution of an errand is successful, Garden is not being called to remove the container and therefore it's left behind (keeping all bind mounts and network settings)

  bosh/0:/var/vcap/sys/log# grep -r "c7d89bc-4062-4630-6baa-e746ce392c47"
warden_cpi/cpi.stderr.log:[File System] 2018/04/04 06:32:51 DEBUG - Making dir /var/vcap/store/warden_cpi/ephemeral_bind_mounts_dir/8c7d89bc-4062-4630-6baa-e746ce392c47 with perm 0755
warden_cpi/cpi.stderr.log:[File System] 2018/04/04 06:32:51 DEBUG - Making dir /var/vcap/store/warden_cpi/persistent_bind_mounts_dir/8c7d89bc-4062-4630-6baa-e746ce392c47 with perm 0755
warden_cpi/cpi.stderr.log:[Cmd Runner] 2018/04/04 06:32:51 DEBUG - Running command 'mount --bind /var/vcap/store/warden_cpi/persistent_bind_mounts_dir/8c7d89bc-4062-4630-6baa-e746ce392c47 /var/vcap/store/warden_cpi/persistent_bind_mounts_
dir/8c7d89bc-4062-4630-6baa-e746ce392c47'
warden_cpi/cpi.stderr.log:[Cmd Runner] 2018/04/04 06:32:51 DEBUG - Running command 'mount --make-unbindable /var/vcap/store/warden_cpi/persistent_bind_mounts_dir/8c7d89bc-4062-4630-6baa-e746ce392c47'
warden_cpi/cpi.stderr.log:[Cmd Runner] 2018/04/04 06:32:51 DEBUG - Running command 'mount --make-shared /var/vcap/store/warden_cpi/persistent_bind_mounts_dir/8c7d89bc-4062-4630-6baa-e746ce392c47'
warden_cpi/cpi.stderr.log:[WardenCreator] 2018/04/04 06:32:51 DEBUG - Creating container with spec garden.ContainerSpec{Handle:"8c7d89bc-4062-4630-6baa-e746ce392c47", GraceTime:0, RootFSPath:"/var/vcap/store/warden_cpi/stemcells/79dd3606-
ed35-438a-57d9-8da15c1d1cd7", Image:garden.ImageRef{URI:"", Username:"", Password:""}, BindMounts:[]garden.BindMount{garden.BindMount{SrcPath:"/var/vcap/store/warden_cpi/ephemeral_bind_mounts_dir/8c7d89bc-4062-4630-6baa-e746ce392c47", Dst
Path:"/var/vcap/data", Mode:0x1, Origin:0x0}, garden.BindMount{SrcPath:"/var/vcap/store/warden_cpi/persistent_bind_mounts_dir/8c7d89bc-4062-4630-6baa-e746ce392c47", DstPath:"/warden-cpi-dev", Mode:0x1, Origin:0x0}}, Network:"10.244.0.142/
20", Properties:garden.Properties{}, Env:[]string(nil), Privileged:true, Limits:garden.Limits{Bandwidth:garden.BandwidthLimits{RateInBytesPerSecond:0x0, BurstRateInBytesPerSecond:0x0}, CPU:garden.CPULimits{LimitInShares:0x0}, Disk:garden.
DiskLimits{InodeSoft:0x0, InodeHard:0x0, ByteSoft:0x0, ByteHard:0x0, Scope:0x0}, Memory:garden.MemoryLimits{LimitInBytes:0x0}, Pid:garden.PidLimits{Max:0x0}}, NetOut:[]garden.NetOutRule(nil), NetIn:[]garden.NetIn(nil)}
warden_cpi/cpi.stderr.log:{{{8c7d89bc-4062-4630-6baa-e746ce392c47}} %!s(*rpc.ResponseError=<nil>) }
warden_cpi/cpi.stderr.log:{"result":"8c7d89bc-4062-4630-6baa-e746ce392c47","error":null,"log":""}
warden_cpi/cpi.stderr.log:{"method":"set_vm_metadata","arguments":["8c7d89bc-4062-4630-6baa-e746ce392c47",{"director":"bosh","deployment":"cf","id":"207f220b-81f4-47fc-979c-bca577c7725c","job":"smoke-tests","instance_group":"smoke-tests",
"index":"0","name":"smoke-tests/207f220b-81f4-47fc-979c-bca577c7725c","created_at":"2018-04-04T06:32:51Z"}],"context":{"director_uuid":"66d53a2a-b9f4-4f1a-9c77-11bcf0e97be0","request_id":"cpi-696061"}}
warden_cpi/cpi.stderr.log:{set_vm_metadata [8c7d89bc-4062-4630-6baa-e746ce392c47 map[id:207f220b-81f4-47fc-979c-bca577c7725c job:smoke-tests instance_group:smoke-tests index:0 name:smoke-tests/207f220b-81f4-47fc-979c-bca577c7725c created_
at:2018-04-04T06:32:51Z director:bosh deployment:cf]] {{"director_uuid":"66d53a2a-b9f4-4f1a-9c77-11bcf0e97be0","request_id":"cpi-696061"}}}

delete_vm is only called when the execution of the errand failed, but not when it succeeded (output does not contain 8c7d89bc-4062-4630-6baa-e746ce392c47)

warden_cpi/cpi.stderr.log:{"method":"delete_vm","arguments":["4af526ea-01bf-472d-6d8c-3fa5029639f2"],"context":{"director_uuid":"66d53a2a-b9f4-4f1a-9c77-11bcf0e97be0","request_id":"cpi-445922"}}
warden_cpi/cpi.stderr.log:{delete_vm [4af526ea-01bf-472d-6d8c-3fa5029639f2] {{"director_uuid":"66d53a2a-b9f4-4f1a-9c77-11bcf0e97be0","request_id":"cpi-445922"}}}
warden_cpi/cpi.stderr.log:{"method":"delete_vm","arguments":["599bf324-0cad-46cc-51dd-8a0e350e105f"],"context":{"director_uuid":"66d53a2a-b9f4-4f1a-9c77-11bcf0e97be0","request_id":"cpi-295787"}}
warden_cpi/cpi.stderr.log:{delete_vm [599bf324-0cad-46cc-51dd-8a0e350e105f] {{"director_uuid":"66d53a2a-b9f4-4f1a-9c77-11bcf0e97be0","request_id":"cpi-295787"}}}
warden_cpi/cpi.stderr.log:{"method":"delete_vm","arguments":["bbf8e339-4ac6-4be3-6768-8d7bee9ab245"],"context":{"director_uuid":"66d53a2a-b9f4-4f1a-9c77-11bcf0e97be0","request_id":"cpi-948746"}}
warden_cpi/cpi.stderr.log:{delete_vm [bbf8e339-4ac6-4be3-6768-8d7bee9ab245] {{"director_uuid":"66d53a2a-b9f4-4f1a-9c77-11bcf0e97be0","request_id":"cpi-948746"}}}
warden_cpi/cpi.stderr.log:{"method":"delete_vm","arguments":["9b0c9a7f-040d-43d6-5284-6f67b0e2bce7"],"context":{"director_uuid":"66d53a2a-b9f4-4f1a-9c77-11bcf0e97be0","request_id":"cpi-987437"}}
warden_cpi/cpi.stderr.log:{delete_vm [9b0c9a7f-040d-43d6-5284-6f67b0e2bce7] {{"director_uuid":"66d53a2a-b9f4-4f1a-9c77-11bcf0e97be0","request_id":"cpi-987437"}}}

I see bosh director calling CPI for cleaning up the VM, but not the CPI triggering a Garden delete (that's the reason because I point to warden-cpi)

I, [2018-04-03T12:19:46 #150695] []  INFO -- DirectorJobRunner: Deleting vms
I, [2018-04-03T12:19:46 #150695] []  INFO -- DirectorJobRunner: Starting to delete job vms
D, [2018-04-03T12:19:46 #150695] [] DEBUG -- DirectorJobRunner: (0.000272s) SELECT * FROM "vms" WHERE (("instance_id" = 1) AND ("active" IS TRUE)) LIMIT 1
D, [2018-04-03T12:19:46 #150695] [] DEBUG -- DirectorJobRunner: (0.000312s) SELECT * FROM "vms" WHERE (("instance_id" = 1) AND ("active" IS TRUE)) LIMIT 1
D, [2018-04-03T12:19:46 #150695] [] DEBUG -- DirectorJobRunner: (0.000108s) BEGIN
D, [2018-04-03T12:19:46 #150695] [] DEBUG -- DirectorJobRunner: (0.000326s) INSERT INTO "events" ("parent_id", "timestamp", "user", "action", "object_type", "object_name", "error", "task", "deployment", "instance", "context_json") VALUES (NULL, '2018-04-03 12:19:46.446698+0000', 'admin', 'delete', 'vm', '69270cda-d2d5-4086-5b12-c086cbe187bc', NULL, '30', 'cf', 'smoke-tests/207f220b-81f4-47fc-979c-bca577c7725c', '{}') RETURNING *
D, [2018-04-03T12:19:46 #150695] [] DEBUG -- DirectorJobRunner: (0.000895s) COMMIT
I, [2018-04-03T12:19:46 #150695] []  INFO -- DirectorJobRunner: Deleting VM
D, [2018-04-03T12:19:46 #150695] [] DEBUG -- DirectorJobRunner: (0.000869s) SELECT * FROM "configs" WHERE ("id" IN (SELECT max("id") FROM "configs" WHERE ("type" = 'cpi') GROUP BY "name"))
D, [2018-04-03T12:19:46 #150695] [] DEBUG -- DirectorJobRunner: (0.000436s) SELECT * FROM "configs" WHERE ("id" IN (SELECT max("id") FROM "configs" WHERE ("type" = 'cloud') GROUP BY "name"))
D, [2018-04-03T12:19:46 #150695] [] DEBUG -- DirectorJobRunner: reserved ranges
D, [2018-04-03T12:19:46 #150695] [] DEBUG -- DirectorJobRunner: (0.000301s) SELECT * FROM "vms" WHERE (("instance_id" = 1) AND ("active" IS TRUE)) LIMIT 1
D, [2018-04-03T12:19:46 #150695] [] DEBUG -- DirectorJobRunner: (0.000087s) BEGIN
D, [2018-04-03T12:19:46 #150695] [] DEBUG -- DirectorJobRunner: (0.000325s) UPDATE "vms" SET "active" = false WHERE ("id" = 19)
D, [2018-04-03T12:19:46 #150695] [] DEBUG -- DirectorJobRunner: (0.000929s) COMMIT
D, [2018-04-03T12:19:46 #150695] [] DEBUG -- DirectorJobRunner: (0.001025s) DELETE FROM "vms" WHERE "id" = 19
D, [2018-04-03T12:19:46 #150695] [] DEBUG -- DirectorJobRunner: (0.000116s) BEGIN
D, [2018-04-03T12:19:46 #150695] [] DEBUG -- DirectorJobRunner: (0.000489s) INSERT INTO "events" ("parent_id", "timestamp", "user", "action", "object_type", "object_name", "error", "task", "deployment", "instance", "context_json") VALUES (367, '2018-04-03 12:19:46.483312+0000', 'admin', 'delete', 'vm', '69270cda-d2d5-4086-5b12-c086cbe187bc', NULL, '30', 'cf', 'smoke-tests/207f220b-81f4-47fc-979c-bca577c7725c', '{}') RETURNING *
D, [2018-04-03T12:19:46 #150695] [] DEBUG -- DirectorJobRunner: (0.000961s) COMMIT
D, [2018-04-03T12:19:46 #150695] [] DEBUG -- DirectorJobRunner: Thread is no longer needed, cleaning up

Allows `ports` cloud property

to be able to port forward from host to specific container

vm_extensions:
- name: port
  cloud_properties:
    ports: [8443:8443, 443:443]

cc @cunnie

bump bosh-utils for tarball compressor

cpu/ram/eph disk sizing limits

How to disable stemcell image decompression for grootfs?

Initially I reported a bug report in grootfs project, but detailed debugging showed that it is a bosh-warden-cpi-release issue.

It looks like warden-cpi always decompresses a stemcell image.

I cannot create an instance because grootfs complain on directory structure. It expects to see a tar archive:

Creating VM with agent ID '{{eb2931fc-1e68-44ef-8492-876b7f01647f}}': Creating container: running image plugin create: invalid base image: directory provided instead of a tar file

How can I disable decompression feature?

Issue

Director task 4
Started unknown
Started unknown > Binding deployment. Done (00:00:00)

Started preparing deployment
Started preparing deployment > Binding releases. Done (00:00:00)
Started preparing deployment > Binding existing deployment. Done (00:00:00)
Started preparing deployment > Binding resource pools. Done (00:00:00)
Started preparing deployment > Binding stemcells. Done (00:00:00)
Started preparing deployment > Binding templates. Done (00:00:00)
Started preparing deployment > Binding properties. Done (00:00:00)
Started preparing deployment > Binding unallocated VMs. Done (00:00:00)
Started preparing deployment > Binding instance networks. Done (00:00:00)

Started preparing package compilation > Finding packages to compile. Done (00:00:00)

Started compiling packages > apache2/4fdebebaa6fcc528155afc5fabae6486131c0cee. Failed: Creating VM with agent ID '0e4a11ab-51ab-4e01-9b81-9691a7a102f8': Creating container: bad response: invalid character 'i' looking for beginning of value (00:00:01)

Error 100: Creating VM with agent ID '0e4a11ab-51ab-4e01-9b81-9691a7a102f8': Creating container: bad response: invalid character 'i' looking for beginning of value

retry on iptables race?

Task 25 | 23:25:42 | Creating missing vms: router/7527ce49-9f27-467e-a730-ae1b326a5be5 (0) (00:00:09)
                   L Error: CPI error 'Bosh::Clouds::CloudError' with message 'Creating VM with agent ID '{{7ae0345c-e2bc-4018-a106-af956d84990d}}': Forwarding host ports: Forwarding host port(s) '{%!s(int=80) %!s(int=80)}': Running command: 'iptables -w -t nat -A PREROUTING -p tcp ! -i w+ --dport 80 -j DNAT --to 10.244.0.34:80 -m comment --comment bosh-warden-cpi-3c8673a5-8c4c-483b-45bf-84c40481c80e', stdout: '', stderr: 'iptables: Resource temporarily unavailable.
': exit status 4' in 'create_vm' CPI method

retry on some io errors?

Error 100: CPI error 'Bosh::Clouds::CloudError' with message 'Creating VM with agent ID '2149f380-8f41-44d8-ba48-598dd5c9d383': Updating container's metadata: Saving metadata: Moving temporary file to destination '/var/vcap/bosh/warden-cpi-metadata.json': Waiting for script: connection: failed to hijack stream Stdout: dial tcp 127.0.0.1:7777: i/o timeout' in 'create_vm' CPI method

VM creation fails due to resolvconf errors when used with garden-runc 1.6.0

Running bosh-lite on stemcell 3421.11, garden-runc 1.6.0, bosh 261.4, warden-cpi 34. When trying to create VMs for compiling the first deployment on the VM, we're getting errors with resolvconf -u: cannot move '/run/resolvconf/resolv.conf-new-<number>' to '/run/resolvconf/resolv.conf': Device or resource busy.

Talked to @julz about this, since using garden-linux resolved the issue. He mentioned that garden-runc does a bindmount of /etc/resolv.conf, similar to how docker behaves. Looks like the docker cpi works around this here: https://github.com/cppforlife/bosh-docker-cpi-release/blob/44c9261e287fe261097919591c139546283aba17/src/github.com/cppforlife/bosh-docker-cpi/vm/factory.go#L76-L86

Is something similar needed in the warden-cpi to better support garden-runc?

Mount inside ephemeral disk breaks delete_vm on openSUSE stemcell

I'm trying to understand this issue we run into while deploying zookeeper on an openSUSE stemcell.
The deployment fails with BOSH lite:

/var/vcap/store/warden_cpi/ephemeral_bind_mounts_dir/be060a40-d076-4be2-41cb-7e1fb1606353/sys/run: mounted in host

    Task 58 | 17:14:01 | Preparing deployment: Preparing deployment (00:00:00)
    Task 58 | 17:14:01 | Preparing package compilation: Finding packages to compile (00:00:00)
    Task 58 | 17:14:01 | Compiling packages: golang-1.8-linux/3eac55db0483de642b1be389966327e931db3e3f
    Task 58 | 17:14:01 | Compiling packages: java/c524e46e61b37894935ae28016973e0e8644fcde
    Task 58 | 17:14:01 | Compiling packages: zookeeper/43ee655b89f8a05cc472ca997e8c8186457241c1 (00:00:31)
    Task 58 | 17:14:49 | Compiling packages: golang-1.8-linux/3eac55db0483de642b1be389966327e931db3e3f (00:00:48)
    Task 58 | 17:14:49 | Compiling packages: smoke-tests/ec91e258c41471227a759c2749e7295cb65eff5a
    Task 58 | 17:14:50 | Compiling packages: java/c524e46e61b37894935ae28016973e0e8644fcde (00:00:49)
    Task 58 | 17:14:59 | Compiling packages: smoke-tests/ec91e258c41471227a759c2749e7295cb65eff5a (00:00:10)
    Task 58 | 17:15:30 | Error: CPI error 'Bosh::Clouds::CloudError' with message 'Deleting vm '{{c93454a4-d977-49c2-68c7-a0544fec5daf}}': Removing ephemeral bind mount: remove /var/vcap/store/warden_cpi/ephemeral_
bind_mounts_dir/c93454a4-d977-49c2-68c7-a0544fec5daf/sys/run: device or resource busy' in 'delete_vm' CPI method

    Task 58 Started  Wed Oct  4 17:14:01 UTC 2017
    Task 58 Finished Wed Oct  4 17:15:30 UTC 2017
    Task 58 Duration 00:01:29
    Task 58 error

    Updating deployment:
      Expected task '58' to succeed but state is 'error'

    Exit code 1

Umounting '/sys/run' before deleting the ephemeral solves this:

diff --git a/src/github.com/cppforlife/bosh-warden-cpi/vm/fs_host_bind_mounts.go b/src/github.com/cppforlife/bosh-warden-cpi/vm/fs_host_bind_mounts.go
index 7fe263d..a64565a 100644
--- a/src/github.com/cppforlife/bosh-warden-cpi/vm/fs_host_bind_mounts.go
+++ b/src/github.com/cppforlife/bosh-warden-cpi/vm/fs_host_bind_mounts.go
@@ -61,7 +61,13 @@ func (hbm FSHostBindMounts) MakeEphemeral(id apiv1.VMCID) (string, error) {
 func (hbm FSHostBindMounts) DeleteEphemeral(id apiv1.VMCID) error {
        path := filepath.Join(hbm.ephemeralBindMountsDir, id.AsString())

-       err := hbm.deletePath(path)
+       sysRunPath := filepath.Join(path, "sys", "run")
+       _, _, _, err := hbm.cmdRunner.RunCommand("umount", sysRunPath)
+       if err != nil && !strings.Contains(err.Error(), "not mounted") {
+               return err
+       }
+
+       err = hbm.deletePath(path)
        if err != nil {
                return bosherr.WrapError(err, "Removing ephemeral bind mount")
        }

I can't find out why this is mounted in the first place. The Ubuntu stemcell does not behave like this.
Maybe this is related to systemd defaulting to MS_SHARED, all mounts (sdc1,...) are visible from within the runc container (checked with runc exec).

error deleting vm with `device or resource busy`

Task 7 | 01:40:38 | Error: CPI error 'Bosh::Clouds::CloudError' with message 'Deleting vm '{{35c8612d-6ced-4fc9-5702-d03f45d6da98}}': Removing ephemeral bind mount: remove /var/vcap/store/warden_cpi/ephemeral_bind_mounts_dir/35c8612d-6ced-4fc9-5702-d03f45d6da98/sys/run: device or resource busy' in 'delete_vm' CPI method

persistent disks are not persistent when using garden-runc v1.1.1 or v1.2.0

Haven't tried this on other versions of garden-runc, but at least on 1.1.1 and 1.2.0, persistent disks do not persist past the lifespan of the VM/garden container.

To reproduce:

Deploy a bosh-lite with garden-runc v1.1.1 or 1.2.0 as the garden backend.
Deploy a BOSH deployment to the bosh-lite that requires persistent disk.
Make changes to that persistent disk.
bosh recreate the VM with persistent disk
The persistent data is gone (expected it to persist)

Notes:
It appears that the mount point isn't being propagated to the garden container properly, and the persistent-disk writes from the container go to the /var/vcap/warden_cpi/persistent_bind_mounts_dir/<cid>/<disk_id> directory, rather than the device that was mounted on top of that directory. When the container is destroyed/recreated, it gets a new cid, and a new persistent_bind_mount directory.

My best guess is there is missing mount propagation that was working under garden-linux, but not present in garden-runc. When I spoke to the garden team about this, they seemed to think it was an issue with the warden-cpi.

Critical error when migrating disks

Hi Dmitriy,

When I try to migrate a persistent disk to a bigger size with the Warden CPI v40, I get the following error about the new device being already mounted.

Using environment '192.168.50.6' as client 'admin'

Using deployment 'concourse'

  instance_groups:
  - name: db
-   persistent_disk_type: 1GB
+   persistent_disk_type: 20GB
Task 83

Task 83 | 19:31:16 | Preparing deployment: Preparing deployment (00:00:02)
Task 83 | 19:31:20 | Preparing package compilation: Finding packages to compile (00:00:01)
Task 83 | 19:31:21 | Updating instance db: db/95ca9a87-276b-437a-8987-eaa68dece35f (0) (canary) (00:00:18)
                   L Error: Action Failed get_task: Task 346b01d4-8157-4f1d-4ad0-58b4cdeeb194 result: Migrating persistent disk: Remounting new disk on original mountpoint: Device /var/vcap/store/warden_cpi/disks/0559a7e7-bc72-4c9f-5ae3-c7804773d292 is already mounted to /warden-cpi-dev/0559a7e7-bc72-4c9f-5ae3-c7804773d292, can't mount to /var/vcap/store
Task 83 | 19:31:39 | Error: Action Failed get_task: Task 346b01d4-8157-4f1d-4ad0-58b4cdeeb194 result: Migrating persistent disk: Remounting new disk on original mountpoint: Device /var/vcap/store/warden_cpi/disks/0559a7e7-bc72-4c9f-5ae3-c7804773d292 is already mounted to /warden-cpi-dev/0559a7e7-bc72-4c9f-5ae3-c7804773d292, can't mount to /var/vcap/store

Task 83 Started  Wed Nov 14 19:31:16 UTC 2018
Task 83 Finished Wed Nov 14 19:31:39 UTC 2018
Task 83 Duration 00:00:23
Task 83 error

Updating deployment:
  Expected task '83' to succeed but state is 'error'

Exit code 1

When I bosh deploy again, I get this other error about mount point not being found.

Using environment '192.168.50.6' as client 'admin'

Task 84

Task 84 | 19:31:54 | Preparing deployment: Preparing deployment (00:00:03)
Task 84 | 19:31:59 | Preparing package compilation: Finding packages to compile (00:00:00)
Task 84 | 19:31:59 | Updating instance db: db/95ca9a87-276b-437a-8987-eaa68dece35f (0) (canary) (00:00:12)
                   L Error: Action Failed get_task: Task 5b060f0c-0669-404d-51da-bc38a2e04a2b result: Migrating persistent disk: Remounting new disk on original mountpoint: Error finding device for mount point /var/vcap/store_migration_target: <nil cause>
Task 84 | 19:32:11 | Error: Action Failed get_task: Task 5b060f0c-0669-404d-51da-bc38a2e04a2b result: Migrating persistent disk: Remounting new disk on original mountpoint: Error finding device for mount point /var/vcap/store_migration_target: <nil cause>

Task 84 Started  Wed Nov 14 19:31:54 UTC 2018
Task 84 Finished Wed Nov 14 19:32:11 UTC 2018
Task 84 Duration 00:00:17
Task 84 error

Capturing task '84' output:
  Expected task '84' to succeed but state is 'error'

Exit code 1

Then, any subsequent bosh deploy always end up with the second error.

I'm running BOSH v268.2.0 with Warden CPI v40 and Garden-RunC v1.16.3. My setup is BOSH-Lite from latest bosh-deployment commit 74eec90, as of 2018-10-24.

Hope you can spot the issue, because it's critical for me to be able to migrate persistent disks with no errors!

Best,
Benjamin