First of all - thank you guys for the chart! I was playing around wi

So it doesn't look like I'm alone! I found issue <a class="issue-link js-issue-link" d

I haven't tried to reproduce, but it's possible that lowering <code class="notranslate

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Indeed, we've had a few failed attempts to fix <a class="issue-link js-issue-link" dat

Slow re-election when elected master pod is deleted,about elastic/helm-charts

Comments (12)

DaveWHarvey commented on July 24, 2024 2

Here is what worked for me, based on the above suggestion. I mounted a script in the container that I run instead of the docker entrypoint which below. It still takes 30s to timeout the old master, but the new master seemed to be operational in 4 seconds of the old master shutdown.
Note: I had already made another fix that seems conceptually necessary to add a pre-shutdown hook on the master to delay the master termination a bit if there is not a quorum + 1 of master nodes. On a k8s rolling upgrade, a restarted master node need to be "ready" when it has opened port 9200, i.e. before it has been added to the cluster, and that allows the rolling upgrade to terminate the existing master before the new master eligible node has fully joined, and master election might not have a quorum.

if [[ -z $NODE_MASTER || "$NODE_MASTER" = "true" ]] ; then

# Run ES as a background task, and forward SIGTERM to it, then wait for it to exit
trap 'kill $(jobs -p)' SIGTERM

/usr/local/bin/docker-entrypoint.sh elasticsearch &

wait

# now keep the pod alive for 30s after ES dies so that we will refuse connections from
# the new master rather than them needing to time  out
sleep 30

else

exec /usr/local/bin/docker-entrypoint.sh elasticsearch

from helm-charts.

DaveCTurner commented on July 24, 2024 1

I'm going to sync with the Elasticsearch team to see how feasible they would be. I'm also making a note to test this in Elasticsearch 7 because it now uses a different discovery method which may or may not be affected by this.

It is still affected by this. The problem is caused by the new master attempting to get all the nodes in the cluster to reconnect to the old master as part of the process of winning its first election, and waiting for this to time out before proceeding is the problem described in elastic/elasticsearch#29025. There's a related discussion here. This is not really affected by the changes to how discovery works in 7.0.

In 6.x the best solution we have is to reduce the connection timeout. If your cluster is not talking to remote systems then the connect timeout can reasonably be very short (e.g. 1s). Although reducing net.ipv4.tcp_retries2 will not directly help here, it's also a good idea. Reducing discovery.zen.fd.ping_timeout is not a great idea because this makes the cluster more sensitive to a long GC, and if the cluster is under memory pressure then removing some of its nodes could well be counterproductive, whereas a lower net.ipv4.tcp_retries2 value allows us to be more sensitive to a network outage without also being sensitive of GC pauses.

In 7.x the same advice is basically true (at time of writing) but there is another option too: if you want to shut the master down then you can trigger an election first by excluding the current master from the voting configuration.

from helm-charts.

andreykaipov commented on July 24, 2024 1

I really appreciate all the attention this issue has received!

The workaround I decided to go with was wrapping the base Elasticsearch image with an entrypoint that traps some common exit signals, and allows the execution of a handler after Elasticsearch stops (see https://github.com/qoqodev/elasticsearch-hooks-docker). In this case, the "post-stop" handler would involve just sleeping, or waiting until a new master has been elected.

However, it looks like @DaveCTurner recently closed out the upstream issue elastic/elasticsearch#29025 with PR elastic/elasticsearch#31547 that should fix the slow re-election, so no workarounds should be necessary! Whenever those changes make it into an Elasticsearch release, whether it's 6.x or 7.x, I'll be glad to test it out! 😄

from helm-charts.

Crazybus commented on July 24, 2024 1

This has been merged into master but not yet released. I'm leaving this until it is released and that others have also confirmed that this solution resolves the issue properly.

from helm-charts.

andreykaipov commented on July 24, 2024

So it doesn't look like I'm alone! I found issue helm/charts#8785 that looks identical to this issue. For that issue, PR helm/charts#10687 was proposed and also submitted to this repo in PR #41. Unfortunately, I didn't have success when deploying the multi-node example with those PRs.

Here's a summary of the problem.

kubectl delete pod multi-master-0 sends a SIGTERM to the pod's container
Elasticsearch shuts down gracefully, broadcasting that info to the other masters, and the pod is removed
Re-election starts, and the other eligible masters try to ping the old master for confirmation (at least I'm guessing that's the motivation behind the ping?). However, the old master's pod is already gone and the associated IP address doesn't exist anymore. So, rather than getting a refused connection, the eligible masters just outright cannot connect. The timeout for this attempt is 30 seconds, and once it timeouts, re-election continues.

Whereas issue helm/charts#8785 talks about a total cluster outage where even reads are not possible, I'm thankfully not seeing that. For example, calls to /_search?q=* still work fine while the eligible masters are stuck in step (3), so that's good. The outage I'm experiencing is for writes and the /_cat APIs, but maybe there's more unresponsive endpoints?

The easiest solution to the connection timeout would be to just keep the pod running for a while after shutting down Elasticsearch. I tried to do this with a preStop lifecycle handler by manually sending a SIGTERM to the Java process, and then just sleeping. However, since the container automatically dies when its main process (the entrypoint) dies, the sleep afterwards never happens.

So, how can we actually run code after Elasticsearch stops? We can modify the container's entrypoint to send Elasticsearch to the background, trap the SIGTERM upon pod termination, forward that to Elasticsearch, and then sleep!

For example, inside of the container spec:

containers:
- name: elasticsearch-master
  image: docker.elastic.co/elasticsearch/elasticsearch:6.6.0
  command: ["bash", "-c"]
  args: ["trap 'pkill -TERM java; sleep 5' TERM; docker-entrypoint.sh eswrapper & wait"]

Deploying our master nodes like this allows for the outer pod to sit in a terminating state for 5 seconds after Elasticsearch stops, so that other masters can properly get a refused connection rather than timing out. As a result, writes and the /_cat APIs hang for no more than the expected re-election time of 3-4 seconds. Awesome! ...Except that entrypoint is hideous!

An alternative approach is to deploy a dummy sidecar container alongside the master nodes that just waits indefinitely. To keep the pod alive for a bit after Elasticsearch stops, we add a preStop handler that just sleeps, and then terminates the indefinitely waiting process. Killing the process manually this way is important since Kubernetes will send the SIGTERM to the entrypoint running as PID 1 (but SIGTERMs to PID 1 processes are ignored).

containers:
- name: just-chillin
  image: alpine
  command: ["sh", "-c"]
  args: ["tail -f /dev/null & wait"]
  lifecycle:
    preStop:
      exec:
        command: ["sh", "-c", "sleep 5; pkill -TERM tail"]
- name: elasticsearch-master
  image: docker.elastic.co/elasticsearch/elasticsearch:6.6.0
  ...

This allows for the Elasticsearch containers to terminate before the sidecar container can, as documented in the pod termination flow. For fun, we can also avoid the lifecycle hook altogether and trap the SIGTERM like we did above:

containers:
- name: just-chillin
  image: alpine
  command: ["sh", "-c"]
  args: ["trap f TERM; f(){ pkill -TERM tail; sleep 5; }; tail -f /dev/null & wait"]
- name: elasticsearch-master
  image: docker.elastic.co/elasticsearch/elasticsearch:6.6.0
  ...

Note that in this case, the order of the sleep and the pkill isn't as important as in the workaround before it.

Apart from these workarounds, another solution might be the solution to the currently open issue kubernetes/kubernetes#28969. If I'm understanding the sticky IPs proposal correctly, pods in a StatefulSet would remain the same for each replica. So in our case, the eligible masters should ideally have no connection timeouts to the old master (since the old master pod will come back with the same IP address).

from helm-charts.

jordansissel commented on July 24, 2024

I haven't tried to reproduce, but it's possible that lowering discovery.zen.fd.ping_timeout (default is 30s) could help with this.

If not, we should look into what's causing this problem and pair up with the Elasticsearch team to find a solution.

I spent a very short time googling for similar issues, and found this forum post where docker/docker-compose appears to be having similar behaviors where the network interface is destroyed and the next pings wait for 30 seconds.

from helm-charts.

Crazybus commented on July 24, 2024

Firstly, thank you so much for writing up such a detailed thorough issue!

The fact that things work properly when you directly kill it is the most interesting part here. It is very possible that kubectl delete pod does not respect the terminationGracePeriod that is set by the helm-chart.

Couple of questions:

Can you reproduce this same issue when doing a rolling upgrade? As in letting Kubernetes restart the containers instead of a kubectl delete pod? Just setting a dummy environment variable would be a good way to trigger this for testing.
How long does it take for kubectl exec multi-master-2 -- kill -SIGTERM 1 to exit? The default terminationGracePeriod has been overridden to be 120 seconds in the helm chart. Running the kill directly from inside the container would not be subject to this timeout. So if it is taking longer to kill then that could help explain why it is more graceful.

from helm-charts.

andreykaipov commented on July 24, 2024

@Crazybus Rolling updates seem to have the same issue. After setting an extra env var and deploying, my requests still hang when the rolling update gets to the elected master pod.

Killing the process directly takes less than a second. After doing this, kubectl get pods will show the targeted pod was restarted once, which is why there's no outage (because the pod was never deleted and other eligible masters were still able to ping it).

I don't think the terminationGracePeriod would affect anything. As I understand it, upon pod deletion, if the processes in a container don't gracefully terminate before the grace period, then Kubernetes will just send a SIGKILL to them and delete the pod. In this case, Elasticsearch does terminate gracefully, but the problem is with other nodes trying to ping the old master's now nonexistent IP address.

I think the forum post @jordansissel linked is identical to this issue. For convenience, here's David's response:

docker stop is too aggressive here - it tears down the node and its network interface before a new node is elected as master. The new node then tries to contact the old master, but because the old master has completely vanished the connection attempt gets no response until it times out.

This is much more common within Docker than without: outside Docker the old master actively and immediately rejects the new master's connection attempt because its network interface is typically still alive.

In both environments, the old master's network is gone before re-election can finish.

I tried lowering discovery.zen.fd.ping_timeout and transport.tcp.connect_timeout in different combinations, and it seemed to help a bit, but not reliably. In the best case, elected master pod deletion only caused an outage of around 15 seconds, but a 30 second outage still happened often.

Looking carefully through the previous Helm chart issue again, I found @kimxogus linked to an issue he opened elastic/elasticsearch#36822 that has the extra suggestion of tweaking the net.ipv4.tcp_retries2 kernel parameter. Lowering this value through privileged init containers on GKE v1.11.7 nodes didn't seem to make a difference for me though. That issue was closed in favor of issue elastic/elasticsearch#29025, which may be a more permanent solution in Elasticsearch itself, by not reconnecting to the old unresponsive master during the re-election.

So far the only reliable workaround for me seems to be keeping the pod alive for a few extra seconds after Elasticsearch terminates so the other eligible masters can send their final ping.

from helm-charts.

Crazybus commented on July 24, 2024

@andreykaipov thanks again for all of the investigation and information you are adding! This is super useful and I feel like I know understand what is going on.

From an Elasticsearch point of view things are actually working as expected. The master disappears, and it waits for the 30 second ping timeout. Elasticsearch is configured by default to wait 30 seconds for a ping to timeout, which is different compared to be able to connect to the host but it not responding.

This issue is not something unique to this helm-chart or to Kubernetes, but will actually apply to any immutable infrastructure setup where the active master is deleted (or at least its IP address) directly after stopping it. Even if a workaround is put in place there is still going to be around 3 seconds of write downtime while a new master is being elected.

The ideal "perfect world" fixes to this problem:

SIGTERM to an active master waits for a new master to be re-elected before the Elasticsearch process exits
Adding a master step down API so that a preStop hook could wait for the master to step down before stopping it.

I'm going to sync with the Elasticsearch team to see how feasible they would be. I'm also making a note to test this in Elasticsearch 7 because it now uses a different discovery method which may or may not be affected by this.

Out of all of the workarounds you suggested I think the easiest to maintain is going to be having a dummy side car container. Instead of doing a 5 second sleep it could actually wait for the pod to no longer be the master when shutting down. Or even better it could wait until a new master has been elected before allowing the pod to be deleted.

Note: None of the below is tested, just an idea of how to solve this without relying on a hardcoded sleep time.

containers:
- name: just-chillin
  lifecycle:
    preStop:
      exec:
        command:
          - sh
          - -c
          - |
            #!/usr/bin/env bash
            set -eo pipefail

            http () {
                local path="${1}"
                if [ -n "${ELASTIC_USERNAME}" ] && [ -n "${ELASTIC_PASSWORD}" ]; then
                  BASIC_AUTH="-u ${ELASTIC_USERNAME}:${ELASTIC_PASSWORD}"
                else
                  BASIC_AUTH=''
                fi
                curl -XGET -s -k --fail ${BASIC_AUTH} {{ .Values.protocol }}://{{ .Values.masterService }}:{{ .Values.httpPort }}${path}
            }

            until http "/_cat/master" | grep -v $(hostname) ; do
              echo "This node is still master, waiting gracefully for it to step down"
            done
- name: elasticsearch-master
  image: docker.elastic.co/elasticsearch/elasticsearch:6.6.0

from helm-charts.

DaveCTurner commented on July 24, 2024

Indeed, we've had a few failed attempts to fix elastic/elasticsearch#29025 and this very thread prompted us to look at it again. The fix is elastic/elasticsearch#39629 which has been backported to the 7.x branch. This means that it won't be in 7.0 or earlier versions but it is expected to be in 7.1.

from helm-charts.

desaintmartin commented on July 24, 2024

I think this is the same problem as helm/charts#8785.

from helm-charts.

Crazybus commented on July 24, 2024

This has been merged and released. Thanks everyone for all of the help investigating and with contributing the fix!

from helm-charts.

Slow re-election when elected master pod is deleted about helm-charts HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent