Comments (12)
Here is what worked for me, based on the above suggestion. I mounted a script in the container that I run instead of the docker entrypoint which below. It still takes 30s to timeout the old master, but the new master seemed to be operational in 4 seconds of the old master shutdown.
Note: I had already made another fix that seems conceptually necessary to add a pre-shutdown hook on the master to delay the master termination a bit if there is not a quorum + 1 of master nodes. On a k8s rolling upgrade, a restarted master node need to be "ready" when it has opened port 9200, i.e. before it has been added to the cluster, and that allows the rolling upgrade to terminate the existing master before the new master eligible node has fully joined, and master election might not have a quorum.
if [[ -z $NODE_MASTER || "$NODE_MASTER" = "true" ]] ; then
# Run ES as a background task, and forward SIGTERM to it, then wait for it to exit
trap 'kill $(jobs -p)' SIGTERM
/usr/local/bin/docker-entrypoint.sh elasticsearch &
wait
# now keep the pod alive for 30s after ES dies so that we will refuse connections from
# the new master rather than them needing to time out
sleep 30
else
exec /usr/local/bin/docker-entrypoint.sh elasticsearch
fi
from helm-charts.
I'm going to sync with the Elasticsearch team to see how feasible they would be. I'm also making a note to test this in Elasticsearch 7 because it now uses a different discovery method which may or may not be affected by this.
It is still affected by this. The problem is caused by the new master attempting to get all the nodes in the cluster to reconnect to the old master as part of the process of winning its first election, and waiting for this to time out before proceeding is the problem described in elastic/elasticsearch#29025. There's a related discussion here. This is not really affected by the changes to how discovery works in 7.0.
In 6.x the best solution we have is to reduce the connection timeout. If your cluster is not talking to remote systems then the connect timeout can reasonably be very short (e.g. 1s
). Although reducing net.ipv4.tcp_retries2
will not directly help here, it's also a good idea. Reducing discovery.zen.fd.ping_timeout
is not a great idea because this makes the cluster more sensitive to a long GC, and if the cluster is under memory pressure then removing some of its nodes could well be counterproductive, whereas a lower net.ipv4.tcp_retries2
value allows us to be more sensitive to a network outage without also being sensitive of GC pauses.
In 7.x the same advice is basically true (at time of writing) but there is another option too: if you want to shut the master down then you can trigger an election first by excluding the current master from the voting configuration.
from helm-charts.
I really appreciate all the attention this issue has received!
The workaround I decided to go with was wrapping the base Elasticsearch image with an entrypoint that traps some common exit signals, and allows the execution of a handler after Elasticsearch stops (see https://github.com/qoqodev/elasticsearch-hooks-docker). In this case, the "post-stop" handler would involve just sleeping, or waiting until a new master has been elected.
However, it looks like @DaveCTurner recently closed out the upstream issue elastic/elasticsearch#29025 with PR elastic/elasticsearch#31547 that should fix the slow re-election, so no workarounds should be necessary! Whenever those changes make it into an Elasticsearch release, whether it's 6.x or 7.x, I'll be glad to test it out! 😄
from helm-charts.
This has been merged into master but not yet released. I'm leaving this until it is released and that others have also confirmed that this solution resolves the issue properly.
from helm-charts.
So it doesn't look like I'm alone! I found issue helm/charts#8785 that looks identical to this issue. For that issue, PR helm/charts#10687 was proposed and also submitted to this repo in PR #41. Unfortunately, I didn't have success when deploying the multi-node example with those PRs.
Here's a summary of the problem.
kubectl delete pod multi-master-0
sends a SIGTERM to the pod's container- Elasticsearch shuts down gracefully, broadcasting that info to the other masters, and the pod is removed
- Re-election starts, and the other eligible masters try to ping the old master for confirmation (at least I'm guessing that's the motivation behind the ping?). However, the old master's pod is already gone and the associated IP address doesn't exist anymore. So, rather than getting a refused connection, the eligible masters just outright cannot connect. The timeout for this attempt is 30 seconds, and once it timeouts, re-election continues.
Whereas issue helm/charts#8785 talks about a total cluster outage where even reads are not possible, I'm thankfully not seeing that. For example, calls to /_search?q=*
still work fine while the eligible masters are stuck in step (3), so that's good. The outage I'm experiencing is for writes and the /_cat
APIs, but maybe there's more unresponsive endpoints?
The easiest solution to the connection timeout would be to just keep the pod running for a while after shutting down Elasticsearch. I tried to do this with a preStop
lifecycle handler by manually sending a SIGTERM to the Java process, and then just sleeping. However, since the container automatically dies when its main process (the entrypoint) dies, the sleep afterwards never happens.
So, how can we actually run code after Elasticsearch stops? We can modify the container's entrypoint to send Elasticsearch to the background, trap the SIGTERM upon pod termination, forward that to Elasticsearch, and then sleep!
For example, inside of the container spec:
containers:
- name: elasticsearch-master
image: docker.elastic.co/elasticsearch/elasticsearch:6.6.0
command: ["bash", "-c"]
args: ["trap 'pkill -TERM java; sleep 5' TERM; docker-entrypoint.sh eswrapper & wait"]
Deploying our master nodes like this allows for the outer pod to sit in a terminating state for 5 seconds after Elasticsearch stops, so that other masters can properly get a refused connection rather than timing out. As a result, writes and the /_cat
APIs hang for no more than the expected re-election time of 3-4 seconds. Awesome! ...Except that entrypoint is hideous!
An alternative approach is to deploy a dummy sidecar container alongside the master nodes that just waits indefinitely. To keep the pod alive for a bit after Elasticsearch stops, we add a preStop
handler that just sleeps, and then terminates the indefinitely waiting process. Killing the process manually this way is important since Kubernetes will send the SIGTERM to the entrypoint running as PID 1 (but SIGTERMs to PID 1 processes are ignored).
containers:
- name: just-chillin
image: alpine
command: ["sh", "-c"]
args: ["tail -f /dev/null & wait"]
lifecycle:
preStop:
exec:
command: ["sh", "-c", "sleep 5; pkill -TERM tail"]
- name: elasticsearch-master
image: docker.elastic.co/elasticsearch/elasticsearch:6.6.0
...
This allows for the Elasticsearch containers to terminate before the sidecar container can, as documented in the pod termination flow. For fun, we can also avoid the lifecycle hook altogether and trap the SIGTERM like we did above:
containers:
- name: just-chillin
image: alpine
command: ["sh", "-c"]
args: ["trap f TERM; f(){ pkill -TERM tail; sleep 5; }; tail -f /dev/null & wait"]
- name: elasticsearch-master
image: docker.elastic.co/elasticsearch/elasticsearch:6.6.0
...
Note that in this case, the order of the sleep
and the pkill
isn't as important as in the workaround before it.
Apart from these workarounds, another solution might be the solution to the currently open issue kubernetes/kubernetes#28969. If I'm understanding the sticky IPs proposal correctly, pods in a StatefulSet would remain the same for each replica. So in our case, the eligible masters should ideally have no connection timeouts to the old master (since the old master pod will come back with the same IP address).
from helm-charts.
I haven't tried to reproduce, but it's possible that lowering discovery.zen.fd.ping_timeout
(default is 30s) could help with this.
If not, we should look into what's causing this problem and pair up with the Elasticsearch team to find a solution.
I spent a very short time googling for similar issues, and found this forum post where docker/docker-compose appears to be having similar behaviors where the network interface is destroyed and the next pings wait for 30 seconds.
from helm-charts.
Firstly, thank you so much for writing up such a detailed thorough issue!
The fact that things work properly when you directly kill it is the most interesting part here. It is very possible that kubectl delete pod
does not respect the terminationGracePeriod
that is set by the helm-chart.
Couple of questions:
- Can you reproduce this same issue when doing a rolling upgrade? As in letting Kubernetes restart the containers instead of a
kubectl delete pod
? Just setting a dummy environment variable would be a good way to trigger this for testing. - How long does it take for
kubectl exec multi-master-2 -- kill -SIGTERM 1
to exit? The defaultterminationGracePeriod
has been overridden to be 120 seconds in the helm chart. Running the kill directly from inside the container would not be subject to this timeout. So if it is taking longer to kill then that could help explain why it is more graceful.
from helm-charts.
@Crazybus Rolling updates seem to have the same issue. After setting an extra env var and deploying, my requests still hang when the rolling update gets to the elected master pod.
Killing the process directly takes less than a second. After doing this, kubectl get pods
will show the targeted pod was restarted once, which is why there's no outage (because the pod was never deleted and other eligible masters were still able to ping it).
I don't think the terminationGracePeriod
would affect anything. As I understand it, upon pod deletion, if the processes in a container don't gracefully terminate before the grace period, then Kubernetes will just send a SIGKILL to them and delete the pod. In this case, Elasticsearch does terminate gracefully, but the problem is with other nodes trying to ping the old master's now nonexistent IP address.
I think the forum post @jordansissel linked is identical to this issue. For convenience, here's David's response:
docker stop
is too aggressive here - it tears down the node and its network interface before a new node is elected as master. The new node then tries to contact the old master, but because the old master has completely vanished the connection attempt gets no response until it times out.This is much more common within Docker than without: outside Docker the old master actively and immediately rejects the new master's connection attempt because its network interface is typically still alive.
In both environments, the old master's network is gone before re-election can finish.
I tried lowering discovery.zen.fd.ping_timeout
and transport.tcp.connect_timeout
in different combinations, and it seemed to help a bit, but not reliably. In the best case, elected master pod deletion only caused an outage of around 15 seconds, but a 30 second outage still happened often.
Looking carefully through the previous Helm chart issue again, I found @kimxogus linked to an issue he opened elastic/elasticsearch#36822 that has the extra suggestion of tweaking the net.ipv4.tcp_retries2
kernel parameter. Lowering this value through privileged init containers on GKE v1.11.7 nodes didn't seem to make a difference for me though. That issue was closed in favor of issue elastic/elasticsearch#29025, which may be a more permanent solution in Elasticsearch itself, by not reconnecting to the old unresponsive master during the re-election.
So far the only reliable workaround for me seems to be keeping the pod alive for a few extra seconds after Elasticsearch terminates so the other eligible masters can send their final ping.
from helm-charts.
@andreykaipov thanks again for all of the investigation and information you are adding! This is super useful and I feel like I know understand what is going on.
From an Elasticsearch point of view things are actually working as expected. The master disappears, and it waits for the 30 second ping timeout. Elasticsearch is configured by default to wait 30 seconds for a ping to timeout, which is different compared to be able to connect to the host but it not responding.
This issue is not something unique to this helm-chart or to Kubernetes, but will actually apply to any immutable infrastructure setup where the active master is deleted (or at least its IP address) directly after stopping it. Even if a workaround is put in place there is still going to be around 3 seconds of write downtime while a new master is being elected.
The ideal "perfect world" fixes to this problem:
SIGTERM
to an active master waits for a new master to be re-elected before the Elasticsearch process exits- Adding a master step down API so that a preStop hook could wait for the master to step down before stopping it.
I'm going to sync with the Elasticsearch team to see how feasible they would be. I'm also making a note to test this in Elasticsearch 7 because it now uses a different discovery method which may or may not be affected by this.
Out of all of the workarounds you suggested I think the easiest to maintain is going to be having a dummy side car container. Instead of doing a 5 second sleep it could actually wait for the pod to no longer be the master when shutting down. Or even better it could wait until a new master has been elected before allowing the pod to be deleted.
Note: None of the below is tested, just an idea of how to solve this without relying on a hardcoded sleep time.
containers:
- name: just-chillin
lifecycle:
preStop:
exec:
command:
- sh
- -c
- |
#!/usr/bin/env bash
set -eo pipefail
http () {
local path="${1}"
if [ -n "${ELASTIC_USERNAME}" ] && [ -n "${ELASTIC_PASSWORD}" ]; then
BASIC_AUTH="-u ${ELASTIC_USERNAME}:${ELASTIC_PASSWORD}"
else
BASIC_AUTH=''
fi
curl -XGET -s -k --fail ${BASIC_AUTH} {{ .Values.protocol }}://{{ .Values.masterService }}:{{ .Values.httpPort }}${path}
}
until http "/_cat/master" | grep -v $(hostname) ; do
echo "This node is still master, waiting gracefully for it to step down"
done
- name: elasticsearch-master
image: docker.elastic.co/elasticsearch/elasticsearch:6.6.0
from helm-charts.
Indeed, we've had a few failed attempts to fix elastic/elasticsearch#29025 and this very thread prompted us to look at it again. The fix is elastic/elasticsearch#39629 which has been backported to the 7.x
branch. This means that it won't be in 7.0 or earlier versions but it is expected to be in 7.1.
from helm-charts.
I think this is the same problem as helm/charts#8785.
from helm-charts.
This has been merged and released. Thanks everyone for all of the help investigating and with contributing the fix!
from helm-charts.
Related Issues (20)
- [Kibana] 8.5.1"message":"secrets \"kibana-kibana-es-token\" already exists" HOT 4
- Support file realm in the Elasticsearch spec
- readiness probe also always fails when security is disabled HOT 1
- Readiness probe is failing because of bad SSL HOT 1
- Kibana chart helm install failing with Error: INSTALLATION FAILED: failed pre-install: job failed: BackoffLimitExceeded HOT 5
- Logstash does not respect logstashJavaOpts
- Out of memory error both on
- Kibana chart fails when security disabled HOT 1
- Enabling security breaks ES endpoints.
- Kibana post delete can fail, leaving artifact behind. HOT 2
- Cannot enable Metricbeat modules HOT 1
- Elasticsearch 7.17.9 HOT 1
- [kibana] helm deployment with no values fails, kubeadm kubernetes deployment 1.25.3 HOT 1
- elasticsearch chart optimizes the discovery.seed_hosts Settings HOT 1
- How to add license on the helmchart
- Randomly receive list of indices with kubectl exec on single node installation.
- Kibana 8.5.1 Upgrade failed from 7.17.3- Error: UPGRADE FAILED: pre-upgrade hooks failed: timed out waiting for the condition
- UNABLE_TO_VERIFY_LEAF_SIGNATURE HOT 5
- [logstash] Statefulset annotations
- kibana cannot be installed, output self signed certificate in certificate chain HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from helm-charts.