I'm running open source DC/OS 1.11 kubernetes. When I start with one kublet, one kube

2nd replica for dns fails,about mesosphere/dcos-kubernetes-quickstart

Comments (34)

pires commented on July 3, 2024 1

Can't reproduce!

I have a DC/OS 1.11 cluster with 3 private and 1 public agents. I installed it using this quickstart:

make gcp

### change .deploy/desired_cluster_profile accordingly

make deploy

Attention: I waited until the script tries to install the package and interrupted it:

### (...DC/OS Open token stuff...)

/usr/local/bin/dcos package install --yes kubernetes --options=./.deploy/options.json
^CUser interrupted command with Ctrl-C

I then created an options file:

$ cat x.json
{
  "kubernetes": {
    "node_count": 3,
    "public_node_count": 1
  }
}

And proceeded to install the package with said options file:

$ dcos package install --yes kubernetes --options x.json
By Deploying, you agree to the Terms and Conditions https://mesosphere.com/catalog-terms-conditions/#certified-services
Kubernetes on DC/OS.

	Documentation: https://docs.mesosphere.com/service-docs/kubernetes
	Issues: https://github.com/mesosphere/dcos-kubernetes-quickstart/issues
Installing Marathon app for package [kubernetes] version [1.0.1-1.9.4]
Installing CLI subcommand for package [kubernetes] version [1.0.1-1.9.4]
New command available: dcos kubernetes
DC/OS Kubernetes is being installed!

Now, tests:

$ dcos kubernetes kubeconfig
kubeconfig context 'pires-tf52f7' created successfully

$ kubectl get nodes
NAME                                          STATUS    ROLES     AGE       VERSION
kube-node-0-kubelet.kubernetes.mesos          Ready     <none>    3m        v1.9.4
kube-node-1-kubelet.kubernetes.mesos          Ready     <none>    3m        v1.9.4
kube-node-2-kubelet.kubernetes.mesos          Ready     <none>    3m        v1.9.4
kube-node-public-0-kubelet.kubernetes.mesos   Ready     <none>    2m        v1.9.4

$ kubectl -n kube-system get pods
NAME                                    READY     STATUS    RESTARTS   AGE
kube-dns-754f9cd4f5-rl6sg               3/3       Running   0          1m
kube-dns-754f9cd4f5-zjk5n               3/3       Running   0          1m
kubernetes-dashboard-5cfddd7d5b-tg4km   1/1       Running   0          48s
metrics-server-54974fd587-4zcwd         1/1       Running   0          1m

We need to be able to reproduce this in order to investigate. @redpine42 am I missing something?

from dcos-kubernetes-quickstart.

bmcustodio commented on July 3, 2024

None of the kube-dns pods could start successfully. Are you using the DC/OS overlay network? Also, you seem to have specified a custom service CIDR. It would be great if you could provide the exact steps you followed to setup this cluster.

from dcos-kubernetes-quickstart.

redpine42 commented on July 3, 2024

I'm running against 6 node bare metal cluster(1 master, 2 public slaves, 3 private slaves). The custom CIDR was just a test. I'm running from the UI. It works when I start kubernetes from the default configuration. When I change the number of kublets, I notice that two kube-dns pods start. The first usually succeeds, the second has failures and goes into a CrashBackLoop. Via kubectl get all, I've seen the metrics server and dashboard fail and restart. Anytime I try to access the dashboard, the dashboard pod crashes and restarts. I could access the kubernetes dashboard when running with one kublet.

from dcos-kubernetes-quickstart.

redpine42 commented on July 3, 2024

Also, I've configured the cluster using the advanced install instructions. My configuration is real basic. Below is the contents of the config.yaml. Other than that there is nothing fancy in my setup.

cluster_name: Redpine
exhibitor_storage_backend: static
master_discovery: static
master_list:
- 192.168.80.11
resolvers:
- 192.168.80.1
- 8.8.8.8
security: disabled

from dcos-kubernetes-quickstart.

bmcustodio commented on July 3, 2024

When I change the number of kublets, I notice that two kube-dns pods start.

This is expected—we run a maximum of two kube-dns replicas, but only one if there is only one node.

The first usually succeeds

Even though the kube-dns pods are listed as Running, only two out of three containers in the pod are healthy. In the particular case of kube-dns, this usually indicates a problem with your overlay network—either not working at all, or some firewall rules blocking traffic.

The custom CIDR was just a test.

Did you also specify a custom network provider, or are you using the DC/OS overlay network?

from dcos-kubernetes-quickstart.

redpine42 commented on July 3, 2024

I have no firewalls running in the cluster. I don't change anything in the startup, but the number of kublets. The CIDR change was just a test.

If it's a problem with a firewall or the overlay network, wouldn't a single kublet also fail?

from dcos-kubernetes-quickstart.

bmcustodio commented on July 3, 2024

It works when I start kubernetes from the default configuration.

I don't change anything in the startup, but the number of kublets.

The CIDR change was just a test.

Did you change the CIDR after installing with the default configuration (i.e., after having installed the package for the first time with the single-Kubelet, default configuration)? Changing the service CIDR after installing the package is not supported.

from dcos-kubernetes-quickstart.

redpine42 commented on July 3, 2024

No. The CIDR has always been the same except for the one test, which was after this problem had been going on for awhile.

from dcos-kubernetes-quickstart.

redpine42 commented on July 3, 2024

I finnaly got to the kube dashboard and it's reporting this.
Readiness probe failed: Get http://9.0.9.32:8081/readiness: dial tcp 9.0.9.32:8081: getsockopt: connection refused.

Is that just a readiness check on the container?

from dcos-kubernetes-quickstart.

redpine42 commented on July 3, 2024

Interestingly, if I start all three kublets and select HA, both kube-dns come up.

Does kube-dns look for the apiserver on its own node? That would explain why most times I only get one kube-dns working when not running in HA, since not running HA starts one api server on my three nodes. Also, sometimes no kube-dns starts, which would be when both kube-dns start on nodes without an apiserver.

@bmcstdio

from dcos-kubernetes-quickstart.

pires commented on July 3, 2024

E0314 13:13:28.427776 1 reflector.go:201] k8s.io/dns/pkg/dns/dns.go:150: Failed to list *v1.Service: Get https://10.90.0.1:443/api/v1/services?resourceVersion=0: dial tcp 10.90.0.1:443: i/o timeout

This is not the default CIDR. And, clearly, kube-dns can't connect to the API in this case.

@bmcstdio out of curiosity, have you tried to start non-HA, scale to three nodes and scale kube-dns?

from dcos-kubernetes-quickstart.

redpine42 commented on July 3, 2024

I am not using that CIDR. I am using the default. That was a one time test out of about 100. So please ignore that I used that CIDR. I'm using the default for the other 99 tests.

from dcos-kubernetes-quickstart.

pires commented on July 3, 2024

@redpine42 ok. so, please, provide us with a detailed way to reproduce this.

from dcos-kubernetes-quickstart.

redpine42 commented on July 3, 2024

I did. Use the default settings, except initialize with 3 kublet nodes.

@pires BTW, why is changing the CIDR an option if it won't work? Maybe being defensive programming is in order?

from dcos-kubernetes-quickstart.

pires commented on July 3, 2024

Please, let's not get this more confusing. Let's try and reproduce this. So, please confirm the following steps:

you have DC/OS 1.11.
install dcos-kubernetes 1.0.1-1.9.4 (latest version) with 3 kube-nodes instead of the default 1.
kubectl -n kube-system get pods shows 2 kube-dns pods: one is running, the other is failing.

from dcos-kubernetes-quickstart.

redpine42 commented on July 3, 2024

@pires Yes DC/OS 1.11
Yes dcos-kubernetes 1.0,101,9,4 with 3 kube-nodes.
kubectl get all shows 2 kube-dns pods: one running, the the other failing?
Also, running on centos 7 with 6 nodes. 1 Master, 2 public and 3 private.

I'm starting up the failing setup. Any output you'd like?

NAME                              READY     STATUS    RESTARTS   AGE
kube-dns-754f9cd4f5-kgtcg         2/3       Running   4          2m
kube-dns-754f9cd4f5-tprcv         3/3       Running   0          2m
metrics-server-54974fd587-hj8s7   1/1       Running   0          2m

from dcos-kubernetes-quickstart.

pires commented on July 3, 2024

I will try and reproduce this now.

from dcos-kubernetes-quickstart.

bmcustodio commented on July 3, 2024

@bmcstdio out of curiosity, have you tried to start non-HA, scale to three nodes and scale kube-dns?

@pires yes, and everything was fine. Double-checked now, just in case, and everything is OK.

from dcos-kubernetes-quickstart.

redpine42 commented on July 3, 2024

@pires You running AWS or bare metal? I did not setup a public node. The default is 0.

I'm launching from the DC/OS UI.

Will try from the cli.

from dcos-kubernetes-quickstart.

pires commented on July 3, 2024

@redpine42 make gcp sets up my infra on Google Cloud. Also, there's no need for a public node so it doesn't matter.

Launching from the DC/OS UI or CLI is the same thing.

from dcos-kubernetes-quickstart.

redpine42 commented on July 3, 2024

Ya. That's what I thought. It's not quite the same test. What OS is GCP running? I know AWS is not quite centos 7.

Also, I upgraded from 1.10.4 to 1.11. Had a few problems with orphaned frameworks at first. Maybe the upgrade doesn't work properly?

from dcos-kubernetes-quickstart.

pires commented on July 3, 2024

Honestly can't tell. Is there the possibility for us to get access to this cluster and try and check it for ourselves?

from dcos-kubernetes-quickstart.

redpine42 commented on July 3, 2024

Sure. Would you like vpn access?

One other thought. I've been testing cilium and upgraded the kernel to 4.15.7-1.el7.elrepo.x86_64. Any known issues there?

from dcos-kubernetes-quickstart.

pires commented on July 3, 2024

If I had access to this cluster, networking would be the first thing I'd investigate. I don't see any issues with the kernel, on the contrary, having newer kernels is usually better.

If you can provide direct access, that would be great as I could jump in for a quick look. If this is a private cluster and more complicated steps are necessary, I am very sorry, but can't do it at this point.

from dcos-kubernetes-quickstart.

redpine42 commented on July 3, 2024

@pires It's private, but it's my system and network. I do testing on it for a client running commercial DC/OS. I just opened ssh to the master. I can send you a private key.

from dcos-kubernetes-quickstart.

pires commented on July 3, 2024

Ping me in the DC/OS community slack, #kubernetes channel.

from dcos-kubernetes-quickstart.

redpine42 commented on July 3, 2024

One thing I noticed is that sometimes traceroute would work on the overlay network for a node and sometimes it wouldn't. The only intervening events being stopping kubernetes in DCOS and starting it backup. So I ran the following test.

Reboot all private nodes.
Install kubernetes.
Uninstall kubernetes.
Reinstall kubernetes.

dcos package install --yes kubernetes --options kube.json          
kubectl get pods
NAME                                    READY     STATUS    RESTARTS   AGE
kube-dns-754f9cd4f5-r2q6b               3/3       Running   0          1m
kube-dns-754f9cd4f5-xn62t               3/3       Running   0          1m
kubernetes-dashboard-5cfddd7d5b-6kw4n   1/1       Running   0          32s
metrics-server-54974fd587-r76j2         1/1       Running   0          56s

dcos package uninstall 

dcos package install --yes kubernetes --options kube.json                                                                                                                                                            

kubectl get pods
NAME                                    READY     STATUS    RESTARTS   AGE
kube-dns-754f9cd4f5-s288c               1/3       Error     2          2m
kube-dns-754f9cd4f5-xhsjf               3/3       Running   0          2m
kubernetes-dashboard-5cfddd7d5b-7vxwk   1/1       Running   0          1m
metrics-server-54974fd587-k78tm         1/1       Running   2          1m

I've run this test a few times with the same results. After a fresh reboot of the private nodes kubernetes comes up clean. Shutting down kubernetes and then starting kubernetes, usually one kubedns fails and sometimes everything fails.

I've also rebooted the servers (without shutting down kubernetes) where I had pods fail and they come back up clean with no failures.

cat kube.json
{
  "kubernetes": {
    "node_count": 3
  }
}

from dcos-kubernetes-quickstart.

redpine42 commented on July 3, 2024

Also, am seeing this in the logs for each of node proxies, i.e., kube-node-2-kube-proxy.

W0323 09:34:39.781117 10 proxier.go:468] clusterCIDR not specified, unable to distinguish between internal and external traffic

from dcos-kubernetes-quickstart.

pires commented on July 3, 2024

That warning is unrelated and harmless.

from dcos-kubernetes-quickstart.

pires commented on July 3, 2024

I am still unable to reproduce this.

from dcos-kubernetes-quickstart.

redpine42 commented on July 3, 2024

Still digging. Some of the nodes the 44.128.0.x gateways I can access and some I can't.

route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         192.168.80.1    0.0.0.0         UG    100    0        0 eth1
9.0.0.0         44.128.0.1      255.255.255.0   UG    0      0        0 vtep1024
9.0.1.0         44.128.0.2      255.255.255.0   UG    0      0        0 vtep1024
9.0.2.0         44.128.0.3      255.255.255.0   UG    0      0        0 vtep1024
9.0.3.0         44.128.0.4      255.255.255.0   UG    0      0        0 vtep1024
9.0.4.0         44.128.0.5      255.255.255.0   UG    0      0        0 vtep1024
9.0.5.0         44.128.0.6      255.255.255.0   UG    0      0        0 vtep1024
9.0.5.128       0.0.0.0         255.255.255.128 U     0      0        0 d-dcos
9.0.6.0         44.128.0.7      255.255.255.0   UG    0      0        0 vtep1024
44.128.0.0      0.0.0.0         255.255.240.0   U     0      0        0 vtep1024
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0
172.18.0.0      0.0.0.0         255.255.0.0     U     0      0        0 d-dcos6
192.168.80.0    0.0.0.0         255.255.255.0   U     0      0        0 eth1
192.168.80.0    0.0.0.0         255.255.255.0   U     100    0        0 eth1

from dcos-kubernetes-quickstart.

redpine42 commented on July 3, 2024

Another error message on the node relevant to the dcos overlay.

Mar 23 17:54:00 tau.redpine.com dcos-net-env[2508]: 17:54:00.950 [error] Supervisor dcos_overlay_sup had child dcos_overlay_lashup_kv_listener started with dcos_overlay_lashup_kv_listener:start_link() at <0.8663.12> exit with reason no match of right hand value {error,3942645759,[]} in dcos_overlay_configure:configure_overlay_entry/4 line 112 in context child_terminated
Mar 23 17:54:06 tau.redpine.com dcos-net-env[2508]: 17:54:06.229 [error] Error in process <0.8797.12> on node '[email protected]' with exit value:
Mar 23 17:54:06 tau.redpine.com dcos-net-env[2508]: {{badmatch,{error,3942645759,[]}},[{dcos_overlay_configure,configure_overlay_entry,4,[{file,"/pkg/src/dcos-net/_build/prod/lib/dcos_overlay/src/dcos_overlay_configure.erl"},{line,112}]},{lists,map,2,[{file,"lists.erl"},{line,1239}]},{lists,map,2,[{file,"lists.erl"},{line,1239}]},{dcos_overlay_configure,maybe_configure,2,[{file,"/pkg/src/dcos-net/_build/prod/lib/dcos_overlay/src/dcos_overlay_configure.erl"},{line,38}]}]}

from dcos-kubernetes-quickstart.

redpine42 commented on July 3, 2024

Based upon the above error. I followed some advice on how to correct. I followed the following steps on all my nodes and my problems have gone away.

sudo systemctl stop dcos-net
sudo rm -rf /var/lib/dcos/navstar/
sudo systemctl start dcos-net

This fixed both the pod startup issues on restarts of just kubernetes in DC/OS and access to the dashboard from kubernetes-proxy.

from dcos-kubernetes-quickstart.

pires commented on July 3, 2024

This is great feedback @redpine42! Thank you very much. @shaneutt maybe you are interested in adding this to your troubleshooting resources?

from dcos-kubernetes-quickstart.

2nd replica for dns fails about dcos-kubernetes-quickstart HOT 34 CLOSED

Comments (34)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent