Git Product home page Git Product logo

csi-lib's Introduction

cert-manager project logo

Build Status Go Report Card
Artifact Hub Scorecard score CLOMonitor

cert-manager

cert-manager adds certificates and certificate issuers as resource types in Kubernetes clusters, and simplifies the process of obtaining, renewing and using those certificates.

It supports issuing certificates from a variety of sources, including Let's Encrypt (ACME), HashiCorp Vault, and Venafi TPP / TLS Protect Cloud, as well as local in-cluster issuance.

cert-manager also ensures certificates remain valid and up to date, attempting to renew certificates at an appropriate time before expiry to reduce the risk of outages and remove toil.

cert-manager high level overview diagram

Documentation

Documentation for cert-manager can be found at cert-manager.io.

For the common use-case of automatically issuing TLS certificates for Ingress resources, see the cert-manager nginx-ingress quick start guide.

For a more comprensive guide to issuing your first certificate, see our getting started guide.

Installation

Installation is documented on the website, with a variety of supported methods.

Troubleshooting

If you encounter any issues whilst using cert-manager, we have a number of ways to get help:

If you believe you've found a bug and cannot find an existing issue, feel free to open a new issue! Be sure to include as much information as you can about your environment.

Community

The cert-manager-dev Google Group is used for project wide announcements and development coordination. Anybody can join the group by visiting here and clicking "Join Group". A Google account is required to join the group.

Meetings

We have several public meetings which any member of our Google Group is more than welcome to join!

Check out the details on our website. Feel free to drop in and ask questions, chat with us or just to say hi!

Contributing

We welcome pull requests with open arms! There's a lot of work to do here, and we're especially concerned with ensuring the longevity and reliability of the project. The contributing guide will help you get started.

Coding Conventions

Code style guidelines are documented on the coding conventions page of the cert-manager website. Please try to follow those guidelines if you're submitting a pull request for cert-manager.

Importing cert-manager as a Module

โš ๏ธ Please note that cert-manager does not currently provide a Go module compatibility guarantee. That means that most code under pkg/ is subject to change in a breaking way, even between minor or patch releases and even if the code is currently publicly exported.

The lack of a Go module compatibility guarantee does not affect API version guarantees under the Kubernetes Deprecation Policy.

For more details see Importing cert-manager in Go on the cert-manager website.

The import path for cert-manager versions 1.8 and later is github.com/cert-manager/cert-manager.

For all versions of cert-manager before 1.8, including minor and patch releases, the import path is github.com/jetstack/cert-manager.

Security Reporting

Security is the number one priority for cert-manager. If you think you've found a security vulnerability, we'd love to hear from you.

Follow the instructions in SECURITY.md to make a report.

Changelog

Every release on GitHub has a changelog, and we also publish release notes on the website.

History

cert-manager is loosely based upon the work of kube-lego and has borrowed some wisdom from other similar projects such as kube-cert-manager.

Logo design by Zoe Paterson

csi-lib's People

Contributors

7ing avatar inteon avatar irbekrm avatar jetstack-bot avatar joshvanl avatar munnerz avatar sgtcodfish avatar wallrj avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

csi-lib's Issues

[bug] labelSelectorForVolume fail to add labels

We notice there is only one CertificateRequest object exists for a pod with two CSI volumes.
From the watch log, it seems we had two CSRs, but one got purged somehow.

The root cause is:

sel.Add(*req)

sel := labels.NewSelector()
req, err := labels.NewRequirement(...)
sel.Add(*req) // this won't add req to the selectors... 

simple fix is like:

sel = sel.Add(*req)

Optional auto rotating/renewing certificates

One of the key features of csi-lib is Automatically rotating/renewing certificates near expiry.
But do we consider make this feature optional?

One particular use case is: one-time and short-lived cert for init-container (for mTLS to pull some secrets).
The user container will no longer need the cert after consuming it. Since the pod is still running, csi-lib will continue the renewal logic for this short-lived cert. With least privilege guidance, shall we disable the renewal in this case ?

Upon checking the code, all certificates will be auto renewed once it hits the NextIssuanceTime:
https://github.com/cert-manager/csi-lib/blob/v0.3.0/manager/manager.go#L499
A workaround is to set the NextIssuanceTime much longer than the pod lifetime. But add an option here (pass in as volumeAttribute) would be much cleaner logic.

Drivers can create CertificateRequests for pods that don't exist in very rare edge cases

In the NodePublishVolume call, we have a defer that calls UnmanageVolume (and deletes metadata from the storage backend) if initial issuance fails:

// clean up after ourselves if provisioning fails.
// this is required because if publishing never succeeds, unpublish is not
// called which leaves files around (and we may continue to renew if so).
success := false
defer func() {
if !success {
ns.manager.UnmanageVolume(req.GetVolumeId())
_ = ns.mounter.Unmount(req.GetTargetPath())
_ = ns.store.RemoveVolume(req.GetVolumeId())
}
}()

If the driver is stopped during this initial 30s period, and the pod is also deleted whilst the driver is stopped, because the publish step never succeeded, the UnpublishVolume step will never be called in future.

Upon the driver starting up again, it will read the metadata.json file and then attempt to request the certificate for that pod again.

Because the pod no longer exists, the UnpublishVolume step will never be called and therefore the certificate data will never be cleaned up on disk (and the driver will continue to process renewals for the volume indefinitely, until an administrator manually cleans up the metadata file on disk and triggers a restart of the driver).

We should do whatever we can to avoid this state occurring, as it causes excessive churn on nodes, in the apiserver, and for signers.

One option would be, on startup of the driver, if metadata.json files that do not have a nextIssuanceTime set on them are found (which implies the issuance has never succeeded), delete this data/clean it up on disk and await NodePublishVolume being called again (for the case where the pod does still exist and is waiting to startup). There may be some edge cases we have not thought of however, whereby the pod does still exist and is provisioned and for some reason this field is not set. Though I don't think that is a possible state to get into...

Driver won't recover if a volume does not contain a metadata file

Problem Statement

In a rare case, if a volume does not contain a metadata file, the driver won't start up (keep crashing) with manager.NewManagerOrDie. This is because the driver tries to ensure all volumes are accessible. If not, it will return an error that cause a panic.

for _, dir := range dirs {
file, err := f.fs.Open(f.metadataPathForVolumeID(dir.Name()))
if err != nil {
// discovered a volume/directory that does not contain a metadata file
// TODO: log this error to allow startup to continue
return nil, err
}
// immediately close the file as we just need to verify it exists
file.Close()
vols = append(vols, dir.Name())
}

Here is a sample error code we have seen:

panic: failed to start manager: listing existing volumes: open csi-data-dir/inmemfs/csi-04b4520253413b4a3120e28f454cc781c11a759627c25a8bbd3f536e8c1c2020/metadata.json: no such file or directory

goroutine 1 [running]:
github.com/cert-manager/csi-lib/manager.NewManagerOrDie({{0x20d1870, 0xc000490450}, 0x0, {0x20cb308, 0xc0004c5b80}, {0x20e4388, 0x3023568}, 0xc000490588, 0x0, {0x7ffdf920c895, ...}, ...})
	/go/pkg/mod/github.com/cert-manager/[email protected]/manager/manager.go:237 +0xa5

This situation could happen when Pod is created --> NodePublishVolume called --> volume created --> OOM. Thus following defer function would never be called:

defer func() {
if !success {
ns.manager.UnmanageVolume(req.GetVolumeId())
_ = ns.mounter.Unmount(req.GetTargetPath())
_ = ns.store.RemoveVolume(req.GetVolumeId())
}
}()

The Fix

To fix this issue, we may simply log / delete the volume and proceed the rest of jobs.

Allow data-root to be an absolute path

Currently, because the data root field on the filesystem/storage struct is plumbed through the new Golang FS interface, it does not support absolute paths.

This makes it confusing for users when configuring their CSI driver deployments, as a relative path must be used (meaning they must have knowledge of the working directory configured in the container).

At the moment, because the working directory is /, this means that users have to strip the leading / from the data directory argument when starting their CSI driver.

/kind bug
/priority important-soon

Failed to read metadata: not found

We deployed the csi-driver to two Kubernetes 1.19 clusters. They periodically start spamming errors every second that the csi-volume is not found anymore.
Example, it repeats for every volume every 1 second):

E1127 11:09:00.005373 1 manager.go:485] manager "msg"="Failed to read metadata" "error"="reading metadata file: not found" "volume_id"="csi-c62ccf515ec64bae8f06fdb36d4a2079e3f2c5aecc91c7ef14cda485836c8bb8"
E1127 11:09:00.011191 1 manager.go:485] manager "msg"="Failed to read metadata" "error"="reading metadata file: not found" "volume_id"="csi-437664f63a3af2714f91ee6f7c28ce39bb2810d338a9ec0652d87a1b0ebf7cdf"
E1127 11:09:00.025988 1 manager.go:485] manager "msg"="Failed to read metadata" "error"="reading metadata file: not found" "volume_id"="csi-fbca1dd6a8e7cde2140d3c8b401f0551791d42dd6a84b6536e6dc438928388ea"
E1127 11:09:00.031470 1 manager.go:485] manager "msg"="Failed to read metadata" "error"="reading metadata file: not found" "volume_id"="csi-89ebfc0f44740cf55bcc32dc92121ac7dc9fc9d88c608343c76322fc2c4b5347"
E1127 11:09:00.042389 1 manager.go:485] manager "msg"="Failed to read metadata" "error"="reading metadata file: not found" "volume_id"="csi-40dcbbf685f818a191215fe4b0409d296303dee1c2482a4c6855e1bcf79ad0fc"
E1127 11:09:00.052478 1 manager.go:485] manager "msg"="Failed to read metadata" "error"="reading metadata file: not found" "volume_id"="csi-db647efd61631afe7203e2291292245b01f000ca82d6d88e8f59befd6e05df6d"
E1127 11:09:00.058340 1 manager.go:485] manager "msg"="Failed to read metadata" "error"="reading metadata file: not found" "volume_id"="csi-47485531b1cad209a46179af6f7352e3f9abb2e45fd0de2b73ab5f3c8d9aa543"
E1127 11:09:00.062959 1 manager.go:485] manager "msg"="Failed to read metadata" "error"="reading metadata file: not found" "volume_id"="csi-347ba22bc8bd0a6a447e652b06adb1e47bf441ea03d871471558662e1b30f6a9"
E1127 11:09:00.123721 1 manager.go:485] manager "msg"="Failed to read metadata" "error"="reading metadata file: not found" "volume_id"="csi-82a36d963d482dcfd15293947e5a21478db708abf746c5bb5aab1bfa9bee1f5d"
E1127 11:09:00.135457 1 manager.go:485] manager "msg"="Failed to read metadata" "error"="reading metadata file: not found" "volume_id"="csi-ff9562742144c98ee8cc34d8e2f4f65b4361523b2841f762a32a9f404807725d"
E1127 11:09:00.146480 1 manager.go:485] manager "msg"="Failed to read metadata" "error"="reading metadata file: not found" "volume_id"="csi-494fb192617fe461bc54d6009a53f7b5182a6d234295c224a4cf6b70a890b770"
E1127 11:09:00.147540 1 manager.go:485] manager "msg"="Failed to read metadata" "error"="reading metadata file: not found" "volume_id"="csi-86d15b62ff043b1b46dbaff93959f889a8927006dc41aa24f77aadc1f2744a67"
E1127 11:09:00.156740 1 manager.go:485] manager "msg"="Failed to read metadata" "error"="reading metadata file: not found" "volume_id"="csi-ac8085a7d8cb7cf81d5c2a0f344844d588f170decba4ef6f8316ae428e4be092"
E1127 11:09:00.163012 1 manager.go:485] manager "msg"="Failed to read metadata" "error"="reading metadata file: not found" "volume_id"="csi-cebc70d6bfa48aae23f68b5a7e39281092572d4c62f4570fd4853cba52ee96be"
E1127 11:09:00.207421 1 manager.go:485] manager "msg"="Failed to read metadata" "error"="reading metadata file: not found" "volume_id"="csi-540a55e995ddd32ac13f5317e6af85b4013b96a83a07d0a82808f5b0e951f87f"
E1127 11:09:00.208836 1 manager.go:485] manager "msg"="Failed to read metadata" "error"="reading metadata file: not found" "volume_id"="csi-363f986cf5d931c54c02296a8aada9c0b7d686acce842966e00dce64f9eda5e1"
E1127 11:09:00.210545 1 manager.go:485] manager "msg"="Failed to read metadata" "error"="reading metadata file: not found" "volume_id"="csi-a3e712107ff31e899a44c6bde807d55ab6433f27b7c493a84bc28324451031ba"
E1127 11:09:00.235020 1 manager.go:485] manager "msg"="Failed to read metadata" "error"="reading metadata file: not found" "volume_id"="csi-f7086201ca296691f705ffb8938c8e62f0b01a812b440933c0ce886f964e4ac3"
E1127 11:09:00.235027 1 manager.go:485] manager "msg"="Failed to read metadata" "error"="reading metadata file: not found" "volume_id"="csi-93772e80b03d346265878f48e02aaa4630ed11bdb16d29581157511c86c4e14a"
E1127 11:09:00.238896 1 manager.go:485] manager "msg"="Failed to read metadata" "error"="reading metadata file: not found" "volume_id"="csi-de0ef9825bb0038c962e3ac59854d5e9d3d7df2a10d0e680f6449b84b62d0283"
E1127 11:09:00.252404 1 manager.go:485] manager "msg"="Failed to read metadata" "error"="reading metadata file: not found" "volume_id"="csi-e4fbd00c2f89bc9871053f815e5e7d7e026c33ebdfa07b3028e05c2fb8e28d01"
E1127 11:09:00.269231 1 manager.go:485] manager "msg"="Failed to read metadata" "error"="reading metadata file: not found" "volume_id"="csi-0c3934c55cecd8e037f648c0a1050754df229e444611b6105757a9a66bf94de5"
E1127 11:09:00.275499 1 manager.go:485] manager "msg"="Failed to read metadata" "error"="reading metadata file: not found" "volume_id"="csi-3ae19fa7ef76d689b521e2e1d1efe49c859f09df0a7b7286bb84cdc1d841f8c5"
E1127 11:09:00.276340 1 manager.go:485] manager "msg"="Failed to read metadata" "error"="reading metadata file: not found" "volume_id"="csi-b00704b29c77b11b3478cec51dc4f5d4e55de0023d6ddfd99da14abe2da8d1b5"
E1127 11:09:00.279666 1 manager.go:485] manager "msg"="Failed to read metadata" "error"="reading metadata file: not found" "volume_id"="csi-5a3693f3802878bfa0fb8ec3e05e40091c73af80bc0b33b19595a05c5d359f3e"
E1127 11:09:00.293034 1 manager.go:485] manager "msg"="Failed to read metadata" "error"="reading metadata file: not found" "volume_id"="csi-d36ccbb820903a32f472d693bf3f3450acfe76c02bca74149f5281c14fa854ba"
E1127 11:09:00.297746 1 manager.go:485] manager "msg"="Failed to read metadata" "error"="reading metadata file: not found" "volume_id"="csi-f593159ed5d2d058120e4e619525aa302adc96e26afc9a89e118c2d2f4b04374"
E1127 11:09:00.309292 1 manager.go:485] manager "msg"="Failed to read metadata" "error"="reading metadata file: not found" "volume_id"="csi-d3a864f7401a7c30919f0f38feabf20eb77d0bd45a59867fe4975a4f2cbc21b5"
E1127 11:09:00.311891 1 manager.go:485] manager "msg"="Failed to read metadata" "error"="reading metadata file: not found" "volume_id"="csi-c75c098855b8cc4bd41d1901602221fc3a28b17a37cf75a7e09751852c1fa71b"
E1127 11:09:00.335159 1 manager.go:485] manager "msg"="Failed to read metadata" "error"="reading metadata file: not found" "volume_id"="csi-4e2839b802be2de062704ea22bea70ffc7545c5c256da3f857111d10cf46c373"
E1127 11:09:00.371723 1 manager.go:485] manager "msg"="Failed to read metadata" "error"="reading metadata file: not found" "volume_id"="csi-2abae01a959c965615605d1e137632bf539deb5c97faff2745aa4f068dfb3f9c"
E1127 11:09:00.385678 1 manager.go:485] manager "msg"="Failed to read metadata" "error"="reading metadata file: not found" "volume_id"="csi-0837b8176fe47dee37d0170d82404366154c69b4d3104d7d3fb2859e333aa8ea"
E1127 11:09:00.397648 1 manager.go:485] manager "msg"="Failed to read metadata" "error"="reading metadata file: not found" "volume_id"="csi-9b2aa410fd23f0660d43f63a6d2145cd8357323bb453bcb179c456dbc72c6fc5"
E1127 11:09:00.431602 1 manager.go:485] manager "msg"="Failed to read metadata" "error"="reading metadata file: not found" "volume_id"="csi-92b6762220d76af9b94906a5fb546169a82cb3175b97a6c7fb4ed788b224703a"
E1127 11:09:00.463203 1 manager.go:485] manager "msg"="Failed to read metadata" "error"="reading metadata file: not found" "volume_id"="csi-d0b5dc1da6a63f14cc432aba61d1719f03b5910a369a2abba011e258d9f9d1b0"
E1127 11:09:00.473169 1 manager.go:485] manager "msg"="Failed to read metadata" "error"="reading metadata file: not found" "volume_id"="csi-2116c0023d6e6119ede6db38294598eeab901634280ed79d25489ed7b74ddaee"
E1127 11:09:00.474435 1 manager.go:485] manager "msg"="Failed to read metadata" "error"="reading metadata file: not found" "volume_id"="csi-c1358f0bae71b3f1c232f71c045ed25419c6426969b97a3f536fb4553224b5d4"
E1127 11:09:00.497268 1 manager.go:485] manager "msg"="Failed to read metadata" "error"="reading metadata file: not found" "volume_id"="csi-dd46f10d92085e4bda8569d39df1c0c191ea25208febf8b711bba7b190baa3ab"
E1127 11:09:00.521907 1 manager.go:485] manager "msg"="Failed to read metadata" "error"="reading metadata file: not found" "volume_id"="csi-8ddcc0fa1c558237bbe59998061bb102704451488f34c8f8145673eb20df7994" 

Whats interesting: Both clusters start erroring around the same time after they have been restarted. I'm not sure if that's a coincidence, as they are both managed by AWS EKS, or if it's related to the drivers being running for the same amount of time.

Uptime of each driver instance is around 5-10 days.

CertificateRequest should be in terminate state when it is denied by an approver

Related code block:
https://github.com/cert-manager/csi-lib/blob/main/manager/manager.go#L305-L309

// Handle cases where the request has been explicitly denied
if apiutil.CertificateRequestIsDenied(updatedReq) {
  cond := apiutil.GetCertificateRequestCondition(updatedReq, cmapi.CertificateRequestConditionDenied)
  return false, fmt.Errorf("request %q has been denied by the approval plugin: %s", updatedReq.Name, cond.Message)
}

Observed behavior:
When CertificateRequest is denied by an approver, the lib will delete it and recreate one. If the approver is auto approve/deny the CSR based on some policies, this would create an infinite loop: create -> denied -> delete -> create again -> ...

Sample logs:

....
I0823 16:00:25.368111       1 nodeserver.go:76] driver "msg"="Registered new volume with storage backend" "pod_name"="csi-64d84b77c-vfblf"
I0823 16:00:25.368202       1 nodeserver.go:86] driver "msg"="Volume registered for management" "pod_name"="csi-64d84b77c-vfblf"
I0823 16:00:25.368376       1 nodeserver.go:97] driver "msg"="Waiting for certificate to be issued..." "pod_name"="csi-64d84b77c-vfblf"
I0823 16:00:26.368756       1 manager.go:514] manager "msg"="Triggering new issuance" "volume_id"="csi-1d842b8307fb94812b128fc0dad4284e119e757ca39bbc164a6a693e8cc2cb80"
I0823 16:00:26.368811       1 manager.go:249] manager "msg"="Processing issuance" "volume_id"="csi-1d842b8307fb94812b128fc0dad4284e119e757ca39bbc164a6a693e8cc2cb80"
I0823 16:00:26.385125       1 manager.go:379] manager "msg"="Deleted CertificateRequest resource" "name"="c7722b4d-cab5-4709-99ed-1576fd4f705e" "namespace"="poc" "volume_id"="csi-1d842b8307fb94812b128fc0dad4284e119e757ca39bbc164a6a693e8cc2cb80"
I0823 16:00:26.503681       1 manager.go:287] manager "msg"="Created new CertificateRequest resource" "volume_id"="csi-1d842b8307fb94812b128fc0dad4284e119e757ca39bbc164a6a693e8cc2cb80"
E0823 16:00:27.512440       1 manager.go:516] manager "msg"="Failed to issue certificate, retrying after applying exponential backoff" "error"="waiting for request: request \"560d3fde-78c0-4d64-9333-3663a672e7a2\" has been denied by the approval plugin: issuer is unauthorized" "volume_id"="csi-1d842b8307fb94812b128fc0dad4284e119e757ca39bbc164a6a693e8cc2cb80"
I0823 16:00:29.711691       1 manager.go:514] manager "msg"="Triggering new issuance" "volume_id"="csi-1d842b8307fb94812b128fc0dad4284e119e757ca39bbc164a6a693e8cc2cb80"
I0823 16:00:29.720993       1 manager.go:249] manager "msg"="Processing issuance" "volume_id"="csi-1d842b8307fb94812b128fc0dad4284e119e757ca39bbc164a6a693e8cc2cb80"
I0823 16:00:29.731453       1 manager.go:379] manager "msg"="Deleted CertificateRequest resource" "name"="560d3fde-78c0-4d64-9333-3663a672e7a2" "namespace"="poc" "volume_id"="csi-1d842b8307fb94812b128fc0dad4284e119e757ca39bbc164a6a693e8cc2cb80"
I0823 16:00:29.807662       1 manager.go:287] manager "msg"="Created new CertificateRequest resource" "volume_id"="csi-1d842b8307fb94812b128fc0dad4284e119e757ca39bbc164a6a693e8cc2cb80"
E0823 16:00:30.808675       1 manager.go:516] manager "msg"="Failed to issue certificate, retrying after applying exponential backoff" "error"="waiting for request: request \"fba62e01-28f9-4644-9f5f-694eaa174084\" has been denied by the approval plugin: issuer is unauthorized" "volume_id"="csi-1d842b8307fb94812b128fc0dad4284e119e757ca39bbc164a6a693e8cc2cb80"
....

Expected behavior:
When CertificateRequest is denied, should be in terminate state. No new object should be created. Or at least has a flag to stop this infinite loop.

A possible solution is to return true in line: https://github.com/cert-manager/csi-lib/blob/main/manager/manager.go#L308

@munnerz

Expect ManageVolumeImmediate() is a blocking call that waits for certificate to be issued

ManageVolumeImmediate() is introduced in #35. After this PR, we observed inconsistent behaviors in following settings:

  • Driver's ContinueOnNotReady is set to true
  • Manager's ReadyToRequestFunc is set as AlwaysReadyToRequest

Before

New pod won't come up until the certificate is issued / written to host path.
Relevant code in v0.3.0:

if isReadyToRequest || !ns.continueOnNotReady {
log.Info("Waiting for certificate to be issued...")
if err := wait.PollUntil(time.Second, func() (done bool, err error) {
return ns.manager.IsVolumeReady(req.GetVolumeId()), nil
}, ctx.Done()); err != nil {
return nil, err
}

Now

New pod can come up before the certificate is issued / written to host path.
Relevant code in main:

  1. meta.NextIssuanceTime is set to epoch time.
    // If continueOnNotReady is enabled, set the NextIssuanceTime in the metadata file to epoch.
    // This allows the manager to start management for the volume again on restart if the first
    // issuance did not successfully complete.
    if meta.NextIssuanceTime == nil && ns.continueOnNotReady {
    epoch := time.Time{}
    meta.NextIssuanceTime = &epoch
    }
  2. ManageVolumeImmediate is called.
    if isReadyToRequest {
    log.V(4).Info("Waiting for certificate to be issued...")
    if _, err := ns.manager.ManageVolumeImmediate(ctx, req.GetVolumeId()); err != nil {
    return nil, err
    }
    log.Info("Volume registered for management")
  3. cert is issued asynchronously in startRenewalRoutine call.

    csi-lib/manager/manager.go

    Lines 559 to 570 in 7dfb05b

    // Only attempt issuance immediately if there isn't already an issued certificate
    if meta.NextIssuanceTime == nil {
    // If issuance fails, immediately return without retrying so the caller can decide
    // how to proceed depending on the context this method was called within.
    if err := m.issue(ctx, volumeID); err != nil {
    return true, err
    }
    }
    if !m.startRenewalRoutine(volumeID) {
    return true, fmt.Errorf("unexpected state: renewal routine not started, please open an issue at https://github.com/cert-manager/csi-lib")
    }

Thoughts

In most cases, this won't be an issue at all. But this inconsistent behavior may break some deployments that don't have checks on cert files (e.g., using init-container). Sample manifest that would get impact on this:

...
spec:
  containers:
  - name: sample-app
     ...
     volumeMounts:
     - mountPath: /path/to/file/tls.crt
        name: csi-cert
        subPath: tls.crt
  volumes:
  - csi:
        driver: sample.csi.driver
        readOnly: true
        name: csi-cert
...

Since kubelet cannot find the file, it will throw following error and prevent container from starting up:

state:
      waiting:
        message: 'failed to start container "ca79b3d8da42ce7f2d92409d0e52260005b1b02a1231997ad3679db95358128f":
          Error response from daemon: failed to create shim task: OCI runtime create
          failed: runc create failed: unable to start container process: error during
          container init: error mounting "/var/lib/kubelet/pods/449f2b9e-e0c5-4781-8177-626984f324bd/volume-subpaths/csi-cert/sample-app/1"
          to rootfs at "/path/to/file/tls.crt": mount /var/lib/kubelet/pods/449f2b9e-e0c5-4781-8177-626984f324bd/volume-subpaths/csi-cert/sample-app/1:/path/to/file/tls.crt
          (via /proc/self/fd/6), flags: 0x5001: not a directory: unknown: Are you
          trying to mount a directory onto a file (or vice-versa)? Check if the specified
          host path exists and is the expected type'
        reason: RunContainerError

CertificateRequests created by the csi driver are not garbage collected due to incorrect apiVersion

When the pods are deleted, the corresponding CertificateRequest resources are supposed to be cleaned up by the Kubernetes garbage collector controller in the background, based on the ownerReferences. But, the background delete by gc isn't happening due to incorrect apiVersion for Pods here:
https://github.com/cert-manager/csi-lib/blob/v0.3.0/manager/manager.go#L414

Kubernetes version: v1.24.4
csi-lib version: 0.3.0

kube-controller-manager gc controller logs(anonymized) shows that the GC picks up the orphan CertificateRequest, but never deletes it, and below logs keep repeating for such an orphan object:

kube-controller-manager[3877594]: I0820 11:01:22.657996 3877594 garbagecollector.go:496] "Processing object" object="default/aaaaa-d5b687789-vfb57" objectUID=43e8b254-8667-4c3a-b5a7-9ae7329b75a9 kind="Pod" virtual=true
kube-controller-manager[3877594]: I0820 11:18:02.671261 3877594 garbagecollector.go:496] "Processing object" object="default/aaaaa-d5b687789-vfb57" objectUID=43e8b254-8667-4c3a-b5a7-9ae7329b75a9 kind="Pod" virtual=true
kube-controller-manager[3877594]: I0820 11:34:42.684420 3877594 garbagecollector.go:496] "Processing object" object="default/aaaaa-d5b687789-vfb57" objectUID=43e8b254-8667-4c3a-b5a7-9ae7329b75a9 kind="Pod" virtual=true
....

I tested by editing the ownerReferences of an orphan CertificateRequest's apiVersion from core/v1 to v1 and restarting the kube-controller-manager to trigger a gc graph rebuild. that orphan CR resource was cleaned up instantly in the next gc controller sync.

cc @munnerz

Race condition: CertificateRequests may never be fulfilled if the issuer was overwhelmed

Problem

In a large cluster (1k+ pods, each with multiple CSI volumes for client certs), during initial startup, most of new pods didn't get their certs issued. When checking the driver logs, we observed the driver creating/deleting the requests over and over again. If we scale up the cluster, things are getting worse, as some pods are never up.

Reproduce

We could reproduce it by faking a long waiting work-queue for CSR controller:

  1. Install cert-manager, cert-manager-csi-driver, and example-app
  2. Bring down all pods: kubectl -n sandbox scale --replicas 0 deployment my-csi-app
  3. Disable CSR controller: kubectl -n cert-manager scale --replicas 0 deployment.apps/cert-manager
  4. Bring up one pod: kubectl -n sandbox scale --replicas 1 deployment my-csi-app
    • Check associated cert-manager-csi-driver pod logs, we noticed a new CR object will be created every 60s, and the old CR obj will be deleted at the same time.
$ kubectl -n sandbox get cr --watch
NAME                                   APPROVED   DENIED   READY   ISSUER              REQUESTOR                                                    AGE
ca-issuer-mb5mz                        True                True    selfsigned-issuer   system:serviceaccount:cert-manager:cert-manager              7m42s
e21c1ea5-02b5-4628-97f5-d59dc21cada6                               ca-issuer           system:serviceaccount:cert-manager:cert-manager-csi-driver   27s
e21c1ea5-02b5-4628-97f5-d59dc21cada6                               ca-issuer           system:serviceaccount:cert-manager:cert-manager-csi-driver   64s
b1585e95-48a9-4bcf-bc7b-99bb72450362                               ca-issuer           system:serviceaccount:cert-manager:cert-manager-csi-driver   0s
b1585e95-48a9-4bcf-bc7b-99bb72450362                               ca-issuer           system:serviceaccount:cert-manager:cert-manager-csi-driver   68s
4d6bc09b-9ef2-40c3-a4a0-7f6303b1330f                               ca-issuer           system:serviceaccount:cert-manager:cert-manager-csi-driver   0s

Sample logs:

I0225 00:00:56.220170       1 nodeserver.go:83] driver "msg"="Registered new volume with storage backend" "pod_name"="my-csi-app-7777dc4568-plmlk"
I0225 00:00:56.220495       1 manager.go:302] manager "msg"="Processing issuance" "volume_id"="csi-92e6eb2c6bfd04e120305b10fe32322f6aff0537cf747360444d060bff180c0b"
I0225 00:00:56.439959       1 manager.go:340] manager "msg"="Created new CertificateRequest resource" "volume_id"="csi-92e6eb2c6bfd04e120305b10fe32322f6aff0537cf747360444d060bff180c0b"
E0225 00:01:56.229502       1 server.go:109] driver "msg"="failed processing request" "error"="waiting for request: request \"1b56a7f2-a4fc-42c1-bac1-b716249fbd89\" has not yet been approved by approval plugin" "request"={} "rpc_method"="/csi.v1.Node/NodePublishVolume"
I0225 00:01:56.819356       1 nodeserver.go:83] driver "msg"="Registered new volume with storage backend" "pod_name"="my-csi-app-7777dc4568-plmlk"
I0225 00:01:56.819546       1 manager.go:302] manager "msg"="Processing issuance" "volume_id"="csi-92e6eb2c6bfd04e120305b10fe32322f6aff0537cf747360444d060bff180c0b"
I0225 00:01:56.836447       1 manager.go:467] manager "msg"="Deleted CertificateRequest resource" "name"="1b56a7f2-a4fc-42c1-bac1-b716249fbd89" "namespace"="sandbox" "volume_id"="csi-92e6eb2c6bfd04e120305b10fe32322f6aff0537cf747360444d060bff180c0b"
I0225 00:01:57.263983       1 manager.go:340] manager "msg"="Created new CertificateRequest resource" "volume_id"="csi-92e6eb2c6bfd04e120305b10fe32322f6aff0537cf747360444d060bff180c0b"

Root Cause

Because a large amount of CertificateRequest are queued in the controller and never be processed in time (within 60s), the driver is timeout and do a retry, which will delete the old CR. and recreate a new one for another 60s.

This inline with following logic in ManageVolumeImmediate(ctx) method:

  • call issue() once, wait up to 60s, which is defined in this line:
    ctx, cancel := context.WithTimeout(ctx, time.Second*60)
  • unmount volume if failed -> NodePublishVolume() call will be retried -> retry ManageVolumeImmediate(ctx)

To Fix

For this issue, the main problem is CSR controller workers never catch up CertificateRequest object creation/deletion speed (every 60s). An easy solution is, do NOT delete stale requests if they are never proceed by any worker (aka. has no conditions in status field). So it will be picked up by a worker eventually.

To further extend this idea, shall we apply it on all CRs that have transient errors or in pending status ? Especially, if a CR has already been approved manually, but never being issued within 60s (current timeout value) for some reasons (e.g., out band issuer, long worker queue, etc.), this CR will be deleted and a new CR will be created, manual approval might be required again and again.

Exponential backoff handling does not apply to certificate renewal in pending phase

Problem

When a certificate get renewal, the driver will generate a new CertificateRequest and wait for it being ready. If there is an error, the driver will do a retry in an exponential backoff pattern. This feature is brought by #35.
However, if the CertificateRequest is in a pending state (e.g., not being handled by CSR controller, not approved, or approved but not issued yet, etc.), the driver will wait infinitely instead of exponential backoff retry. This behavior is not inline with the above failure situation.

Reproduce

We could reproduce it by faking a long waiting work-queue for CSR controller:

  1. Install cert-manager, cert-manager-csi-driver, and example-app
  2. Add csi.cert-manager.io/duration: "1m" to csi volumeAttribute in the above example-app, to reduce renewal interval.
    • Check associated cert-manager-csi-driver pod logs, you may observe "Triggering new issuance" message every 40s (2/3 of duration time).
  3. Now run kubectl -n cert-manager scale --replicas 0 deployment.apps/cert-manager to disable CSR controller.
    • Check associated cert-manager-csi-driver pod logs, no additional message after "Triggering new issuance". That's because the driver is in an infinite waiting loop.

Root Cause

In startRenewalRoutine() method, issue() will be called upon renewal time, controlled by RenewalBackoffConfig

  • each FAILED call will be retried with an interval time set by RenewalBackoffConfig: e.g., 10s, 20s, 40s, ...,
  • each PENDING call will wait until ctx.Timeout reached

However, in this case, ctx.Timeout is undefined as we use ctx from this line:

csi-lib/manager/manager.go

Lines 608 to 609 in 6284a0d

// Create a context that will be cancelled when the stopCh is closed
ctx, cancel := context.WithCancel(context.Background())

Therefore, if ctx has no timeout, the issue(ctx) call will wait until the CertificateRequest is in Failed or Ready condition. In our example here, since CertificateRequest is never being processed, so it is considered as PENDING, which won't end at all.

To Fix

Each retry in startRenewalRoutine() should have a timeout (configurable), otherwise we won't see next retry until previous one return an error.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.