Git Product home page Git Product logo

csi-sanlock-lvm's People

Contributors

aleofreddi avatar mfielding avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

csi-sanlock-lvm's Issues

Volume expansion fails

Volume expansion fails due to a regression:

E0308 19:35:44.055829       1 grpclogger.go:33] gRPC[31984]: call /csi.v1.Node/NodeExpandVolume({"capacity_range":{"required_bytes":2147483648},"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-ffc98d75-0051-43a2-896d-5f23f6201545/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}},"volume_id":"csi-v-laOElFJHKsXUYx8hy7itaD8qm4oocHj+pqJV6XUcvNU@vg_ssd","volume_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-ffc98d75-0051-43a2-896d-5f23f6201545/globalmount"}) returned error rpc error: code = Internal desc = failed to resize filesystem: rpc error: code = Internal desc = failed to resize volume /dev/vg_ssd/csi-v-laOElFJHKsXUYx8hy7itaD8qm4oocHj+pqJV6XUcvNU:

ReadWriteMany support

It is a great and neat project for k8s clusters with a single shared disk. But there is a little fly in the ointment — it hasn't ReadWriteMany accessMode support. It blocks me from using it for live migrations in KubeVirt setup.

Have you considered implementing it? As far as i know, shared LV's support non-inclusive mode where multiple hosts can work with a single LV (lvchange -asy).

Pass all the csi-sanity tests

While some tests are failing due to missing features, some others are really exposing bugs. Below the summary:

Summarizing 11 Failures:

[Fail] Node Service NodeUnpublishVolume [It] should fail when the volume is missing
/root/go/src/github.com/kubernetes-csi/csi-test/v3/pkg/sanity/node.go:435

[Fail] Node Service NodeStageVolume [It] should fail when no volume capability is provided
/root/go/src/github.com/kubernetes-csi/csi-test/v3/pkg/sanity/node.go:514

[Fail] Node Service [It] should be idempotent
/root/go/src/github.com/kubernetes-csi/csi-test/v3/pkg/sanity/node.go:196

[Fail] ExpandVolume [Controller Server] [It] should work
/root/go/src/github.com/kubernetes-csi/csi-test/v3/pkg/sanity/controller.go:1850

[Fail] ListSnapshots [Controller Server] [It] should return next token when a limited number of entries are requested
/root/go/src/github.com/kubernetes-csi/csi-test/v3/pkg/sanity/controller.go:1534

[Fail] DeleteSnapshot [Controller Server] [It] should fail when no snapshot id is provided
/root/go/src/github.com/kubernetes-csi/csi-test/v3/pkg/sanity/controller.go:1592

[Fail] Controller Service [Controller Server] CreateVolume [It] should fail when requesting to create a volume with already existing name and different capacity.
/root/go/src/github.com/kubernetes-csi/csi-test/v3/pkg/sanity/controller.go:602

[Fail] Controller Service [Controller Server] CreateVolume [It] should fail when the volume source snapshot is not found
/root/go/src/github.com/kubernetes-csi/csi-test/v3/pkg/sanity/controller.go:728

[Fail] Controller Service [Controller Server] CreateVolume [It] should fail when the volume source volume is not found
/root/go/src/github.com/kubernetes-csi/csi-test/v3/pkg/sanity/controller.go:785

[Fail] Controller Service [Controller Server] DeleteVolume [It] should fail when no volume id is provided
/root/go/src/github.com/kubernetes-csi/csi-test/v3/pkg/sanity/controller.go:807

[Fail] Controller Service [Controller Server] ValidateVolumeCapabilities [It] should fail when the requested volume does not exist
/root/go/src/github.com/kubernetes-csi/csi-test/v3/pkg/sanity/controller.go:1009

Ran 55 of 74 Specs in 150.593 seconds
FAIL! -- 44 Passed | 11 Failed | 1 Pending | 18 Skipped

CreateVolume fails under heavy load due to `DeadlineExceeded`

Ex:

E0714 19:55:25.171104       1 grpclogger.go:33] gRPC[3505]: call /csi.v1.Controller/CreateVolume({"accessibility_requirements":{"preferred":[{"segments":{"csi-sanlock-lvm/topology":"1805"}},{"segments":{"csi-sanlock-lvm/topology":"1804"}}],"requisite":[{"segments":{"csi-sanlock-lvm/topology":"1805"}},{"segments":{"csi-sanlock-lvm/topology":"1804"}}]},"capacity_range":{"required_bytes":34359738368},"name":"pvc-bc55dfac-2491-4642-b64c-fbb39f5bebe7","parameters":{"volumeGroup":"vg_hdd"},"volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}]}) returned error rpc error: code = Internal desc = failed to fetch volume csi-v-2tzFm2lfv_KJIdaQOcG5IXtgpbu7PvVa6IWkJaBwD8g@vg_hdd: rpc error: code = DeadlineExceeded desc = context deadline exceeded

Support mount permissions

With non root-containers (for example the PostgreSQL ones installed by bitnami/postgresql), we have to resort to workarounds like volume-permissions-parameters to get the persistent volume correctly initialized.

Looking at the CSI spec, it seems that VolumeCapability.MountVolume.volume_mount_group would be the correct way to handle that directly from the driver - hence this ticket.

pvc resize fails on edge case

I've observed a case where the resize operation hang due to its pod being missing.

Under this conditions, it seems that the volume is not mounted anywhere, and thus resize2fs fails as follows:

resize2fs 1.44.1 (24-Mar-2018)
ext2fs_check_mount_point: Can't check if filesystem is mounted due to missing mtab file while determining whether /dev/vg_ssd/csl-v-aRvD2SRI0xT9LJ4JNrVDoOS1GQlez0il6HVCX6EJkWs is mounted.

To be investigated.

ListSnapshots causes a panic

I0318 10:59:00.866137    6099 grpclogger.go:29] gRPC[6]: call /csi.v1.Controller/ListSnapshots
panic: runtime error: index out of range [1] with length 1

goroutine 36 [running]:
github.com/aleofreddi/csi-sanlock-lvm/driverd/pkg.volumeIdToVgLv(...)
	/root/csi-sanlock-lvm/driverd/pkg/controllerserver.go:574
github.com/aleofreddi/csi-sanlock-lvm/driverd/pkg.(*controllerServer).ListSnapshots(0xc00010ec90, 0xa688c0, 0xc0001e4c90, 0xc0001f6c60, 0x0, 0x0, 0x0)
	/root/csi-sanlock-lvm/driverd/pkg/controllerserver.go:532 +0x8e4
github.com/container-storage-interface/spec/lib/go/csi._Controller_ListSnapshots_Handler.func1(0xa688c0, 0xc0001e4c90, 0x9759c0, 0xc0001f6c60, 0xc0001d8aa0, 0x2, 0x2, 0xc0001e8a80)
	/root/go/pkg/mod/github.com/container-storage-interface/[email protected]/lib/go/csi/csi.pb.go:5326 +0x86
github.com/aleofreddi/csi-sanlock-lvm/lvmctrld/pkg.GrpcLogger(0xa688c0, 0xc0001e4c90, 0x9759c0, 0xc0001f6c60, 0xc0001e8a60, 0xc0001e8a80, 0xc0001d8b30, 0x55dab8, 0x95aee0, 0xc0001e4c90)
	/root/csi-sanlock-lvm/lvmctrld/pkg/grpclogger.go:31 +0x210
github.com/container-storage-interface/spec/lib/go/csi._Controller_ListSnapshots_Handler(0x96caa0, 0xc00010ec90, 0xa688c0, 0xc0001e4c90, 0xc0001f6c00, 0x9d32c8, 0xa688c0, 0xc0001e4c90, 0xc0001e2300, 0xf)
	/root/go/pkg/mod/github.com/container-storage-interface/[email protected]/lib/go/csi/csi.pb.go:5328 +0x14b
google.golang.org/grpc.(*Server).processUnaryRPC(0xc000080c00, 0xa6cf00, 0xc000228300, 0xc0001fc300, 0xc0001424e0, 0xde9730, 0x0, 0x0, 0x0)
	/root/go/pkg/mod/github.com/grpc/[email protected]/server.go:1024 +0x4f4
google.golang.org/grpc.(*Server).handleStream(0xc000080c00, 0xa6cf00, 0xc000228300, 0xc0001fc300, 0x0)
	/root/go/pkg/mod/github.com/grpc/[email protected]/server.go:1313 +0xd97
google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc0001480b0, 0xc000080c00, 0xa6cf00, 0xc000228300, 0xc0001fc300)
	/root/go/pkg/mod/github.com/grpc/[email protected]/server.go:722 +0xbb
created by google.golang.org/grpc.(*Server).serveStreams.func1

Promote from alpha to beta

I've been using this driver since some time, I'd like to take the occasion to do some more cleanups and move it to beta - so that upgrades are supported :)

snapshot creation fails with `Size is not a multiple of 512` error

Using maxSizePct can lead to snapshot sizes that are not multiple of 512b, causing the snapshot to fail with the following error:

Message:     Failed to check and update snapshot content: failed to take snapshot of the volume csl-v-Ep1UaCzhbQEP1ovLaYemh1bZzPXF68JIG+nqbdBJaKg@vg_ssd: "rpc error: code = Unknown desc = rpc error: code = Unknown desc = unexpected error: rc=3 stdout=\"\" stderr=\"  Size is not a multiple of 512. Try using 429496320 or 429496832.\\n  Invalid argument for --size: 429496729b\\n  Error during parsing of command line.\\n\""

Support volumes backed by snapshots

Lvm snapshots can be used as copy-on-write forks of their origin volume. This would come in handy for various use cases; in particular backup: having volumes backed by snapshots would help a lot because one could snapshot the target volume, create a new (copy-on-write) volume from the snapshot and then run their preferred backup utility on it (let's say spin up a job that runs bacula-fd on the temporary volume).

As of now, one can achieve the same result materialising back the snapshot into a new volume; which is a resource waste: you need to have at least double the space for the origin volume and you'll waste iops copying the snapshot back to the volume.

But there is catch: on clustered lvm, snapshots can only be activated alongside their origin - so from a Kubernetes perspective that means that a volume backed by a snapshot must be published on the same node where its origin reference is (if any).

Quoting @saad-ali: while the CSI spec doesn’t support this use case, one way you could make it work on k8s is to add a new controller to your CSI driver. This controller would be responsible for watching any Pv objects created by your CSI driver and updating the VolumeNodeAffinity fields when the source volume moves. This assumes that the VolumeNodeAffinity field is mutable (I’m not sure, you’ll have to double check). And to prevent races, you’ll need to make sure that the response to the createVolume call always contains the initial topology constraints (so the PV is never in a state without a VolumeNodeAffinity).

lvmctrld crashloops because wdmd won't start

I0104 17:51:24.876377       1 main.go:40] Starting lvmctrld latest (994ed36)
I0104 17:51:24.876493       1 lock.go:45] Running wdmd with args [-D]
I0104 17:51:24.877936       1 lock.go:45] Running sanlock with args [daemon -w 1 -D]
I0104 17:51:24.885214       1 lock.go:45] Running lvmlockd with args [--host-id 1292 -f]
F0104 17:51:24.950312       1 lock.go:53] Process wdmd terminated with error: exit status 255

ext2fs_check_if_mount: Can't check if filesystem is mounted due to missing mtab file

After recoving a resize gone wrong due to insufficient disk space, I've encountered this one:

csi-sanlock-lvm-system csi-sanlock-lvm-plugin-sx6mv driverd E1223 12:01:08.808550       1 grpclogger.go:33] gRPC[8589495]: call /csi.v1.Node/NodeExpandVolume({"capacity_range":{"required_bytes":34359738368},"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-a51d19ac-54ca-4f22-a2f7-1138d9632a42/globalmount","volume_capability":{"AccessType":{"Mount":{}},"access_mode":{"mode":1}},"volume_id":"csl-v-RPNQIvCBXgY1WCXoqW9XALZ8qx1MU79yyvMXACZ+FqY@vg_ssd","volume_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-a51d19ac-54ca-4f22-a2f7-1138d9632a42/globalmount"}) returned error rpc error: code = Internal desc = failed to resize filesystem: rpc error: code = Internal desc = failed to resize volume /dev/vg_ssd/csl-v-RPNQIvCBXgY1WCXoqW9XALZ8qx1MU79yyvMXACZ+FqY: exit status 1 ( ext2fs_check_if_mount: Can't check if filesystem is mounted due to missing mtab file while determining whether /dev/mapper/vg_ssd-csl--v--RPNQIvCBXgY1WCXoqW9XALZ8qx1MU79yyvMXACZ+FqY is mounted.

pv creation fails with PVC.spec.resources.requests.storage is not a multiple of 512

Similar to #34.

Creating pvc with storage request is non multiple of 512 causes error.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nonmul512
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: "11362347344"
  storageClassName: sharedlun
  volumeMode: Filesystem
  Warning  ProvisioningFailed    12s (x5 over 27s)  csi-sanlock-lvm.csi.vleo.net_csi-sanlock-lvm-provisioner-0_e5642b25-76ae-42dc-adfb-00f25f1e33c0  failed to provision volume with StorageClass "sharedlun": rpc error: code = Unknown desc = failed to create volume csl-v-oCR6aqE2yL_2b46egYSliYc2Jbc_5KgGKzSEAsL2Yew@sharedlun: rpc error: code = Unknown desc = unexpected error: rc=3 stdout="" stderr="  Size is not a multiple of 512. Try using 11362347008 or 11362347520.\n  Invalid argument for --size: 11362347344b\n  Error during parsing of command line.\n"

Failed to initialize listener: failed to update lvmctrld address: failed to update tags

I0522 16:35:07.029908       1 lvmctrldclient.go:124] Connecting to tcp://192.168.14.1:9000
I0522 16:35:07.030151       1 lvmctrldclient.go:152] Still trying, connection CONNECTING
I0522 16:35:07.031635       1 lvmctrldclient.go:149] Connected
E0522 16:35:07.170483       1 main.go:44] Failed to initialize listener: failed to update lvmctrld address: failed to update tags on vg_hdd/csi-v-7cv4sY+62TYKmDTrBhLZbPh_RZVH18k73YksT3R0F8o: rpc error: code = PermissionDenied desc = target is locked by another host

pvc stuck in pending state

Creating a pvc requiring more space than available causes a pvc to get stuck in the pending state:

Warning  ProvisioningFailed    2m51s (x9 over 6m51s)  csi-sanlock-lvm.csi.vleo.net_csi-sanlock-lvm-provisioner-0_674c5717-5a48-4888-a5e6-ff851b7aa034  failed to provision volume with StorageClass "ssd-storageclass": rpc error: code = Internal desc = failed to create volume csl-v-3C+7ur310_Iusl8KjI4B4Um9+HCv4y_xffD3cwY7BNA@vg_ssd: rpc error: code = OutOfRange desc = insufficient free space

This one should fail with OutOfRange as per https://github.com/container-storage-interface/spec/blob/master/spec.md#createvolume-errors, however it seems the the out of range exception is wrapped into a generic one, which confuses Kubernetes.

Support multiple writers on the same node

As of now we support only the SINGLE_NODE_WRITER and SINGLE_NODE_READER_ONLY access modes, which makes deployments with a pv impractical (even with a single replica, things like rolling updates won't work because it would require two pods to coexist for some time).

So far I've been sticking to stateful sets, however the CSI spec now has two new modes that might help here:

  • SINGLE_NODE_SINGLE_WRITER;
  • SINGLE_NODE_MULTI_WRITER.

Can't expand pvc: `didn't find a plugin capable of expanding the volume`

Lately I've observed this error on pvc resize:

Warning  ExternalExpanding  4m54s  volume_expand  Ignoring the PVC: didn't find a plugin capable of expanding the volume; waiting for an external controller to process this PVC.

It seems that csi-sanlock-lvm-resizer-0 fails to list pods:

E0305 15:51:16.194095       1 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User "system:serviceaccount:csi-sanlock-lvm-system:csi-resizer" cannot list resource "pods" in API group "" at the cluster scope

lvmctrld stuck due to `Global lock failed: error -210`

Found an instance stuck as follows:

E1111 17:31:06.529205       1 grpclogger.go:33] gRPC[1095848]: call /LvmCtrld/Vgs({}) returned error unexpected error: rc=5 stdout="  {\n      \"report\": [\n          {\n              \"vg\": [\n              ]\n          }\n      ]\n  }\n" stderr="  Global lock failed: error -210\n"

It seems ENOLS is -210

resize doesn't work anymore

It seems there is a regression:

Events:
  Type     Reason             Age   From           Message
  ----     ------             ----  ----           -------
  Warning  ExternalExpanding  11m   volume_expand  Ignoring the PVC: didn't find a plugin capable of expanding the volume; waiting for an external controller to process this PVC.

protoc-gen-go requires option go_package

Got the following warning while building with libprotoc 3.11.4:

2020/04/22 20:39:52 WARNING: Missing 'go_package' option in "proto/lvmctrld.proto", please specify:
	option go_package = "proto";
A future release of protoc-gen-go will require this be specified.
See https://developers.google.com/protocol-buffers/docs/reference/go-generated#package for more information.

Mailbox/diskrpc vacuum

Sending RPC calls to a stale node can cause the rpc volume to fill up, and there is no mechanism at this moment to timeout such requests and reclaim the space.

The proposal is to add a new background job that will periodically check the state every of queue so to timeout and free messages that are stuck for long time.

Some manifests use deprecated APIs

Some manifests use deprecated APIs, to be updated:

% kubectl apply -k "https://github.com/aleofreddi/csi-sanlock-lvm/deploy/kubernetes-$kver?ref=v0.4"
...
Warning: apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
...
Warning: storage.k8s.io/v1beta1 CSIDriver is deprecated in v1.19+, unavailable in v1.22+; use storage.k8s.io/v1 CSIDriver
...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.