Right now we assume that all containers run forever. We should support configurable re

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Two relevant Docker issues being discussed: <a class="issue-link js-issue-link" da

Configurable restart behavior,about kubernetes/kubernetes

Comments (36)

bgrant0607 commented on May 14, 2024

Thoughts on multi-container pods:

First of all, I think restart behavior should be specified at the pod level rather than the container level. It wouldn't make sense for one container to terminate and another to restart forever, for example.

Run forever is obviously easy -- we're doing it now.

Run once is fairly easy, too, I think. As soon as one container terminates, probably all should be terminated (call this policy "any").

For run until success, we could restart individual containers until each succeeds (call this policy "all").

We should make all vs. any a separate policy from forever vs. success vs. once. Another variant people would probably want is a "leader" container, to which the other containers' lifetimes would be tied. Since we start containers in order, the leader would need to be the first one in the list. To play devil's advocate, if we had event hooks (#140), the user could probably implement "any" and "leader" policies if we only provided "all" semantics.

Now, set-level behavior:

replicationController at least needs to know the conditions under which it should replace terminated/lost instances. It's hard to provide precise success semantics since containers can be lost with indeterminate exit status, but that's technically true even for a single pod. replicationController should be able to see the restart policies and termination reasons of the pods it controls. If a pod terminates and should not be restarted, I think replicationController should just automatically reduce its desired replica count by one.

I could also imagine users wanting any/all/leader behavior at the set level. However, I don't think we should do that, which leads me to believe we shouldn't do it at the pod level for now, either. If we were to provide the functionality at the set level, it shouldn't be tied to repllicationController. Instead, it would be a separate policy resource associated with the pods via its own label selector. This would allow it to work at either the service level or replicationController level or over any other grouping the user desired. We should ensure that it's not too hard to implement these types of policies using event hooks.

from kubernetes.

bgrant0607 commented on May 14, 2024

FWIW, two people have recommended the Erlang supervisor model/spec:
http://www.erlang.org/doc/design_principles/sup_princ.html
http://www.erlang.org/doc/man/supervisor.html

IIUC, Erlang calls my "any" policy "one for all" and my "all" policy "one for one".

from kubernetes.

smarterclayton commented on May 14, 2024

How would an API client know that the individual container "succeeded"? (for a definition of success)

from kubernetes.

bgrant0607 commented on May 14, 2024

@smarterclayton If you're asking about how Kubernetes will detect successful termination, we need machine-friendly, actionable status codes from Docker and from libcontainer (#137). Every process management and workflow orchestration system in the universe is going to need that. Normal termination with exit code 0 should indicate success.

If you're asking how Kubernetes's clients would detect termination, they could poll the currentState of the pod. We don't really have per-container status information there yet -- we'd need to add that. A Watch API would be better than polling -- that's worth an issue of its own. We could also provide a special status communication mechanism to containers and/or to event hooks (e.g., tmpfs file or environment variable). On top of these primitives, we could build a library and command-line operation to wait for termination and return the status.

from kubernetes.

aspyker commented on May 14, 2024

ran into this for Acme Air for case # 3 (run-once) for initial database loader processes

from kubernetes.

smarterclayton commented on May 14, 2024

@bgrant0607 Probably fair to restate my question as whether you have a model in mind (based on previous experience at Google) that defines what you feel is a scalable and reliable mechanism for communicating fault and exit information to an API consumer - for instance, to implement run-once or run-at-least-once containers that are expected to exit and not restart as aspyker mentioned.

For instance, Mesos defines a protocol between master and slave that attempts to provide some level of guarantees for communicating the exit status, subject to the same limitations you noted above about not being truly deterministic. That model assumes bidirectional communication between master and slave, which Kubernetes does not.

Agree that watch from client->slave or client->master->slave is better than polling, although it seems more difficult to scale the master when the number of watchers+tasks grows. Do you see the master recording exit status for run-once containers in a central store, or that being a separate subsystem that could scale orthogonally to the api server / replication server and aggregate events generated by the minions? I could imagine that transient failures of containers with a "restart-always" policy would be useful to know to an api consumer - to be able to see that container X restarted at time T1, T2, and T3.

from kubernetes.

bgrant0607 commented on May 14, 2024

@smarterclayton First, I think the master should delegate the basic restart policy to the slave: always restart, restart on failure, never restart. The master should only handle cross-node restarts directly. And, yes, the master should pull status from the slaves and store it (#156), as well as periodically check their health (#193). As scaling demands grow, that responsibility could indeed be split out to a separate component or set of components.

Reason for last termination (#137), termination message from the application (#139), time of last termination, and number of terminations should be among the information collected. State of terminated containers/pods should be kept around on the slaves long enough for the master to observe it the vast majority of the time (e.g., 10x the normal polling interval, or 2x the period after which an unresponsive node would be considered failed, anyway; explicit decoupling of stop vs. delete would also enable the master to control deletion of observed terminated containers/pods), but the master would record unobserved terminations as having failed, ideally with as much specificity as possible about what happened (node was unresponsive, node failed, etc.). A monotonically increasing count of restarts could be converted to approximate recency, sliding-window counts, rates, and other useful information by continuous observers. A means of setting or resetting the count is sometimes useful, but non-essential. Termination events could also be buffered (in a bounded-sized ring buffer with an event sequence number) and streamed off the slave and logged for debugging, but shouldn't be necessary for correctness, since events could always be lost.

Reasons for system-initiated container stoppages (e.g., due to liveness probe failures - #66) should be explicitly recorded (#137), but can be treated as failures with respect to restart policy. User-initiated cancellation should override the restart policy, as should user-initiated restarts (#159).

With more comprehensive failure information from libcontainer and Docker we could distinguish container setup errors from later-stage execution failures, but if in doubt, the slave (and master) should be conservative about not restarting "run once" containers that may have started execution.

Containers should have unique identifiers so the system doesn't confuse different instances or incarnations (#199).

Overall system architecture for availability, scalability, fault tolerance, etc. should be discussed elsewhere.

from kubernetes.

bgrant0607 commented on May 14, 2024

Two relevant Docker issues being discussed:
moby/moby#26 auto-restart processes
moby/moby#1311 production-ready process monitoring

The former is converging towards supporting restarts in Docker, with the 3 modes proposed here: always, on failure, never.

The latter has been debating the merits of "docker exec", which would not run the process under supervision of the Docker daemon. The motivation is to facilitate management by process managers such as systemd, supervisord, upstart, monit, runit, etc. This approach is attractive for a number of reasons.

from kubernetes.

smarterclayton commented on May 14, 2024

While not called out in the latter issue, the blocking dependency is the ability for the daemon to continue to offer a consistent API for managing containers. This was one of the inspirations for libswarm - allowing the daemon to connect to a process running in the container namespace in order to issue commands that affect the container as a unit (stop, stream logs, execute a new process). The refactored Docker engine to allow that currently exists in a branch of Michael Crosby's, but libchan and swarm are not mature enough yet to provide that behavior.

from kubernetes.

thockin commented on May 14, 2024

All of this sounds reasonable to me, except the part about it being specified per-pod rather than per-container. I don't think it is far fetched to have an initial loader container that runs to completion when a pod lands on a host and then exits, while the main server is in "run forever" mode.

I don't think forcing the spec to be per-pod buys any simplicity, either. Containers are the things that actually run, why would I spec the policy on the pod?

from kubernetes.

lexlapax commented on May 14, 2024

Been following this thread. I hope my comments are welcome, as I/we are trying to figure out a way to contribute actual code, configurations etc.

The way I was looking at it, it makes sense for the restart behavior to be a the pod level rather than at the container level, keeping in the abstraction around pods expose a service (composed of one or more containers that may communicate between them and may share compute/network/storage resources)..

For the behavior around singleton containers, you can always have a pod with just one container, which would get you the same thing.

The notion of pods as a service endpoints is much more powerful than the notion of singleton containers as service endpoints.

This again deviates slightly from the original docker intent - a container is a service encapsulation, which is not entirely true,.. that's why you have docker links, and now things like etcd or dns based inter-container linkages which sort of start breaking down when it comes to dependencies etc..

The pod abstraction helps in that regard, and as stated, you could always have one container pods..

from kubernetes.

thockin commented on May 14, 2024

You don't have to sell me on pods. My concern is that attaching restarts to pods feels artificial for very little gain (its not much simpler, really) and makes impossible some easy-to-imagine use cases.

from kubernetes.

lexlapax commented on May 14, 2024

simplicity wise, how would this be different conceptually in unix from say a kill signal to a group of processes (pod) vs a kill signal to a singular process (container). implementation wise, it should just cascade down to individual processes..

from kubernetes.

ironcladlou commented on May 14, 2024

Per-container policies seems the most flexible to me. Another example of a use for per-container policy would be adding a run-once container to an existing pod.

You can compose pod-level behavior using container-level policy, but the inverse is not true.

One disadvantage I can see to per-container is added complexity to the spec. Maybe defaults can help with this. Related: could a pod-scoped default for containers make sense, or would that add more cognitive overhead than it's worth?

from kubernetes.

thockin commented on May 14, 2024

Dan, I expect the average number of containers per pod to be low - less
than 5 for the vast majority of case - so I don't think that the logic to
support a pod-level default is worthwhile (yet?). It would also set a
precedent for the API that we would sort of be expected to follow for other
things, and that will just lead to complexity in the code.

If it turns out to be a pain point, we can always add more API later - but
getting rid of API is harder.

On Fri, Jul 18, 2014 at 7:41 AM, Dan Mace [email protected] wrote:

Per-container policies seems the most flexible to me. Another example of a
use for per-container policy would be adding a run-once container to an
existing pod.

You could compose pod-level behavior using container-level policy, but the
inverse is not true.

One disadvantage I can see to per-container is added complexity to the
spec. Maybe defaults can help with this. Related: could a pod-scoped
default for containers make sense, or would that add more cognitive
overhead than it's worth?

Reply to this email directly or view it on GitHub
#127 (comment)
.

from kubernetes.

ironcladlou commented on May 14, 2024

@thockin Points well taken. I agree that the complexity of an additional pod-level API is premature.

from kubernetes.

pmorie commented on May 14, 2024

I think the policy has to be configurable on a container level but a pod-level default will be convenient to have in the spec. If the policy is only configurable on the pod level, that seems that it would prevent you from being able to run a transient task (run-once) in a pod of run-forever containers.

from kubernetes.

pmorie commented on May 14, 2024

I only read @thockin 's point after posting the above comment. I accept these points; can live without pod-default at the moment.

from kubernetes.

lexlapax commented on May 14, 2024

I would agree as well, as long as we're open to having a way to extend those apis later to the pod-level , when required. Thanks.

from kubernetes.

bgrant0607 commented on May 14, 2024

@thockin @smarterclayton @ironcladlou @lexlapax @pmorie @dchen1107

Regarding per-pod vs. per-container restart policies:

Pods are not intended to be scheduling targets, and containers within a pod are not intended to be used for intra-pod workflows. We have no plans to support arbitrary execution-order dependencies between containers within a pod, for example.

The containers are part of the definition of the pod. The containers associated with a pod should only change via explicit update of the pod definition to add/remove/update its containers. When one container terminates, that container should not be removed from the definition of the pod implicitly, but should have its termination status reported.

The reason for a pod-level restart policy is because it affects the lifetime of the pod as a whole. A pod should only be terminated once all the containers within it terminate.

I have not been able to think of a single valid use case where containers within the pod should have different restart policies. It seems confusing, unnecessarily complex, and likely to promote misuse of the pod abstraction, such as with all proposed use cases in this issue.

The common case for multi-container pods is for services where all containers run forever. The common case for batch and test workloads that terminate is just one container per pod. We should allow multiple containers that terminate, but we need to implement clean semantics for this case.

One-off or periodic operations should be performed using runin, which we should expose as soon as the first-class support in Docker is completed, by forking from within a container, or with entirely new pods.

Things like initial data loaders should be triggered using event hooks #140 . Restart policy is not sufficient to make this work.

I also don't want to implement increasingly complex restart policies in the core, but instead provide hooks such that users or higher-level APIs can implement whatever policies they please. In fact, we could entirely punt on restart policies with the right hooks, by giving users the option to nuke their own containers, pods, or sets of containers upon termination, before they restart. However, a simple restart policy would be easier to use for common batch and test use cases, and would convey useful semantic/intent information to the system about the type of workload being run.

from kubernetes.

lavalamp commented on May 14, 2024

One-off or periodic operations should be performed using runin, which we should expose as soon as the first-class support in Docker is completed, by forking from within a container, or with entirely new pods.

Can you expand on this? Reusing a pod for a periodic action doesn't seem consistent with our model to me.

from kubernetes.

bgrant0607 commented on May 14, 2024

@lavalamp I was thinking runin would be used for the one-off case, mostly, such as for debugging, data recovery, or emergency maintenance.

Continuous background and/or periodic use cases include:

cleanup / GC / maintenance / wipeout
serving data generation / aggregation / indexing / import
defensive analysis (spam, abuse, dos, etc.)
logs processing / billing / audit / report generation
integrity checking / validation
online/offline feedback / adaptation / machine learning
data snapshots / copies / backups
periodic build/push

"Cloud-native" workloads would store the data to distributed/cloud storage and launch new pods to do the processing, similar to Scheduled Tasks in AppEngine.

Legacy workloads that store data on local storage volumes would need these tasks to run locally, and/or have some way of offloading the data. Some people (e.g., https://news.ycombinator.com/item?id=7258009) argue that one should run cron inside the container to do this, but then that would require a process manager / supervisor. Instead, one could run it in a container by itself, accessing a shared volume, either files in the volume or named sockets or pipes, similar to how log management can be handled.

from kubernetes.

dchen1107 commented on May 14, 2024

I had a long offline discussion with @bgrant0607 this morning, and I agreed with him based on the definition of pod is a scheduling unit, not a scheduling target. Once you agreed with the definition, a list of potential use cases as intra-pod workflows could be ruled out. Enabling intra-pod workflows through hierarchical scheduling is too complicated, error-prone and not necessary for most of use cases if not all. The usecases which has a run_forever controller container in a pod, and a bunch of run_til_succeed batch jobs listening to controller, should be handled at the higher level.

I came up several possible usecases. One is pre-config type container only run once and personalize pod for service. Brian pointed out it could be handled by event hook. I agreed that event hook is a clean way to handle this, even it is much harder for the users to use at the beginning. Another usecase is cron-type job or debugging process, but run-in feature should handle that.

In this case, I couldn't come up any more usecases which has different restart policies for containers in a given pod. If a service job want to run forever, its monitoring and logging collector jobs should also run forever. A canary version service wants to run once, all its helper containers only requires to run once.

I actually started a PR to introduce a restart policy at container level based on my instinct and my passed experiences. But I failed to convince myself with a solid / valid use cases based on Pod definition. That is why I called a meeting with Brian, and he obviously convinced me on this very topic.

from kubernetes.

bgrant0607 commented on May 14, 2024

Meanwhile: moby/moby#7226

from kubernetes.

thockin commented on May 14, 2024

Which is notably, per-container. I get the argument you are making, but I
don't find per-container restart policies to be particularly complicated to
understand or model. Is it really worth diverging from Docker here?

On Fri, Jul 25, 2014 at 2:29 PM, bgrant0607 [email protected]
wrote:

Meanwhile: moby/moby#7226
moby/moby#7226

Reply to this email directly or view it on GitHub
#127 (comment)
.

from kubernetes.

dchen1107 commented on May 14, 2024

The following PRs are merged. I am close this issue for 0.5 release for now.

#1147 Introduce the simplest RestartPolicy and handling.
#1199 Passing pod UUID to Kubelet.
#1262 Fix PodInfo to include last terminated container info when calling.

For more advantage feature support, please file separate issues and loop me in. We are working on v1beta3 API, which will change some API related to RestartPolicy.

from kubernetes.

bkeroackdsc commented on May 14, 2024

"I have not been able to think of a single valid use case where containers within the pod should have different restart policies. It seems confusing, unnecessarily complex, and likely to promote misuse of the pod abstraction, such as with all proposed use cases in this issue."

My use case (#9836) involves an application which reads etcd for configuration--this configuration must be populated on pod startup, but will vary based upon execution environment (controlled by environmental variables set in the RC spec). I have a container which runs an application that writes a templated set of configuration data to etcd and then exits.

So the alternative to this pattern is to have a "configurator" pod, which is tightly bound to another "application" pod, and must discover it via service lookup? That seems unnecessarily complex and confusing to me.

Forgive me if I'm misunderstanding the issues here. Per-container restart policy seems like the obviously correct answer.

from kubernetes.

smarterclayton commented on May 14, 2024

The alternative is a start-hook that does all this configuration before your main container is started.

----- Original Message -----

"I have not been able to think of a single valid use case where containers
within the pod should have different restart policies. It seems confusing,
unnecessarily complex, and likely to promote misuse of the pod abstraction,
such as with all proposed use cases in this issue."

My use case (#9836) involves an application which reads etcd for
configuration--this configuration must be populated on pod startup, but will
vary based upon execution environment (controlled by environmental variables
set in the RC spec). I have a container which runs an application that
writes a templated set of configuration data to etcd and then exits.

So the alternative to this pattern is to have a "configurator" pod, which is
tightly bound to another "application" pod, and must discover it via service
lookup? That seems unnecessarily complex and confusing to me.

Reply to this email directly or view it on GitHub:
#127 (comment)

from kubernetes.

thockin commented on May 14, 2024

I don't think this issue should be re-opened. There are a number of options for how to do this (some may be more implemented than others).

Wrap your real app in a loader script that fetches config, writes it, then exec()s your app
Implement an init-phase of a pod, wherein all init containers must exit successfully before all non-init containers can start
Run a sidecar container that writes an "OK to go" file to a volume, block your app on the existence of that file
Make your app wait until the etcd value you need exists, gracefully handle it not existing or being too old

I do not think that changing the restart policy to per-container is the right answer (and I argued for it earlier in this issue - I was wrong).

from kubernetes.

davidopp commented on May 14, 2024

Implement an init-phase of a pod, wherein all init containers must exit successfully before all non-init containers can start

So implicitly, the restart policy for the init containers is always RestartPolicyOnFailure?

Run a sidecar container that writes an "OK to go" file to a volume, block your app on the existence of that file

What is the restart policy for the sidecar container?

Make your app wait until the etcd value you need exists, gracefully handle it not existing or being too old

How does the etcd configuration get populated?

from kubernetes.

davidopp commented on May 14, 2024

(I'll re-close the issue but am curious about the above.)

from kubernetes.

bkeroackdsc commented on May 14, 2024

@thockin My hack works for now, but I'm thinking option 1 (wrap application in loader script) is probably the best way forward if per-container restart policies are off the table. The pod init phase concept is interesting, though.

from kubernetes.

thockin commented on May 14, 2024

Implement an init-phase of a pod, wherein all init containers must exit successfully before all non-init containers can start

So implicitly, the restart policy for the init containers is always RestartPolicyOnFailure?

Pretty much.

Run a sidecar container that writes an "OK to go" file to a volume, block your app on the existence of that file

What is the restart policy for the sidecar container?

Same as the rest of the pod, idempotency is for the user to provide.

Make your app wait until the etcd value you need exists, gracefully handle it not existing or being too old

How does the etcd configuration get populated?

By a side-car loader container.

from kubernetes.

thockin commented on May 14, 2024

yeah, I think an init phase is a pretty nice feature.

On Wed, Jun 17, 2015 at 10:50 AM, bkeroackdsc [email protected]
wrote:

@thockin https://github.com/thockin My hack works for now, but I'm
thinking option #1
#1 (wrap
application in loader script) is probably the best way forward if
per-container restart policies are off the table. The pod init phase
concept is interesting, though.

—
Reply to this email directly or view it on GitHub
#127 (comment)
.

from kubernetes.

bgrant0607 commented on May 14, 2024

If initializing a specific volume, then #831 is the way to go.

Otherwise, a pod-level pre-start hook.

This doesn't need to be reopened -- see also #9836 and #4282.

from kubernetes.

maicohjf commented on May 14, 2024

@smarterclayton First, I think the master should delegate the basic restart policy to the slave: always restart, restart on failure, never restart. The master should only handle cross-node restarts directly. And, yes, the master should pull status from the slaves and store it (#156), as well as periodically check their health (#193). As scaling demands grow, that responsibility could indeed be split out to a separate component or set of components.

Reason for last termination (#137), termination message from the application (#139), time of last termination, and number of terminations should be among the information collected. State of terminated containers/pods should be kept around on the slaves long enough for the master to observe it the vast majority of the time (e.g., 10x the normal polling interval, or 2x the period after which an unresponsive node would be considered failed, anyway; explicit decoupling of stop vs. delete would also enable the master to control deletion of observed terminated containers/pods), but the master would record unobserved terminations as having failed, ideally with as much specificity as possible about what happened (node was unresponsive, node failed, etc.). A monotonically increasing count of restarts could be converted to approximate recency, sliding-window counts, rates, and other useful information by continuous observers. A means of setting or resetting the count is sometimes useful, but non-essential. Termination events could also be buffered (in a bounded-sized ring buffer with an event sequence number) and streamed off the slave and logged for debugging, but shouldn't be necessary for correctness, since events could always be lost.

Reasons for system-initiated container stoppages (e.g., due to liveness probe failures - #66) should be explicitly recorded (#137), but can be treated as failures with respect to restart policy. User-initiated cancellation should override the restart policy, as should user-initiated restarts (#159).

With more comprehensive failure information from libcontainer and Docker we could distinguish container setup errors from later-stage execution failures, but if in doubt, the slave (and master) should be conservative about not restarting "run once" containers that may have started execution.

Containers should have unique identifiers so the system doesn't confuse different instances or incarnations (#199).

Overall system architecture for availability, scalability, fault tolerance, etc. should be discussed elsewhere.

Create a Kubernetes Secret as follows:
 Name: super-secret
 Credential: alice or username: bob
Create a Pod named pod-secrets-via-file using the redis image which mounts a secret named
super-secret at /secrets
Create a second Pod named pod-secrets-via-env using the redis image, which exports credential
/ username as TOPSECRET / CREDENTIALS

kubectl create secret generic spuer-secret --from-literal=Credential=alice
apiVersion: v1
kind: Pod
metadata:
name: pod-secrets-via-file
spec:
containers:
- name: pod-secrets-via-file
image: redis
volumeMounts:
- name: super-secret
mountPath: "/secret"
volumes:
- name: super-secret
secret:
secretName: super-secret

apiVersion: v1
kind: Pod
metadata:
name: pod-secrets-via-env
spec:
containers:
- name: pod-secrets-via-env
image: redis
env:
- name: TOPSECRET
valueFrom:
secretKeyRef:
name: super-secret
key: Credential

from kubernetes.

Configurable restart behavior about kubernetes HOT 36 CLOSED

Comments (36)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent