Comments (34)
Can't reproduce!
I have a DC/OS 1.11 cluster with 3 private and 1 public agents. I installed it using this quickstart:
make gcp
### change .deploy/desired_cluster_profile accordingly
make deploy
Attention: I waited until the script tries to install the package and interrupted it:
### (...DC/OS Open token stuff...)
/usr/local/bin/dcos package install --yes kubernetes --options=./.deploy/options.json
^CUser interrupted command with Ctrl-C
I then created an options file:
$ cat x.json
{
"kubernetes": {
"node_count": 3,
"public_node_count": 1
}
}
And proceeded to install the package with said options file:
$ dcos package install --yes kubernetes --options x.json
By Deploying, you agree to the Terms and Conditions https://mesosphere.com/catalog-terms-conditions/#certified-services
Kubernetes on DC/OS.
Documentation: https://docs.mesosphere.com/service-docs/kubernetes
Issues: https://github.com/mesosphere/dcos-kubernetes-quickstart/issues
Installing Marathon app for package [kubernetes] version [1.0.1-1.9.4]
Installing CLI subcommand for package [kubernetes] version [1.0.1-1.9.4]
New command available: dcos kubernetes
DC/OS Kubernetes is being installed!
Now, tests:
$ dcos kubernetes kubeconfig
kubeconfig context 'pires-tf52f7' created successfully
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
kube-node-0-kubelet.kubernetes.mesos Ready <none> 3m v1.9.4
kube-node-1-kubelet.kubernetes.mesos Ready <none> 3m v1.9.4
kube-node-2-kubelet.kubernetes.mesos Ready <none> 3m v1.9.4
kube-node-public-0-kubelet.kubernetes.mesos Ready <none> 2m v1.9.4
$ kubectl -n kube-system get pods
NAME READY STATUS RESTARTS AGE
kube-dns-754f9cd4f5-rl6sg 3/3 Running 0 1m
kube-dns-754f9cd4f5-zjk5n 3/3 Running 0 1m
kubernetes-dashboard-5cfddd7d5b-tg4km 1/1 Running 0 48s
metrics-server-54974fd587-4zcwd 1/1 Running 0 1m
We need to be able to reproduce this in order to investigate. @redpine42 am I missing something?
from dcos-kubernetes-quickstart.
None of the kube-dns
pods could start successfully. Are you using the DC/OS overlay network? Also, you seem to have specified a custom service CIDR. It would be great if you could provide the exact steps you followed to setup this cluster.
from dcos-kubernetes-quickstart.
I'm running against 6 node bare metal cluster(1 master, 2 public slaves, 3 private slaves). The custom CIDR was just a test. I'm running from the UI. It works when I start kubernetes from the default configuration. When I change the number of kublets, I notice that two kube-dns pods start. The first usually succeeds, the second has failures and goes into a CrashBackLoop. Via kubectl get all, I've seen the metrics server and dashboard fail and restart. Anytime I try to access the dashboard, the dashboard pod crashes and restarts. I could access the kubernetes dashboard when running with one kublet.
from dcos-kubernetes-quickstart.
Also, I've configured the cluster using the advanced install instructions. My configuration is real basic. Below is the contents of the config.yaml. Other than that there is nothing fancy in my setup.
cluster_name: Redpine
exhibitor_storage_backend: static
master_discovery: static
master_list:
- 192.168.80.11
resolvers:
- 192.168.80.1
- 8.8.8.8
security: disabled
from dcos-kubernetes-quickstart.
When I change the number of kublets, I notice that two kube-dns pods start.
This is expected—we run a maximum of two kube-dns
replicas, but only one if there is only one node.
The first usually succeeds
Even though the kube-dns
pods are listed as Running
, only two out of three containers in the pod are healthy. In the particular case of kube-dns
, this usually indicates a problem with your overlay network—either not working at all, or some firewall rules blocking traffic.
The custom CIDR was just a test.
Did you also specify a custom network provider, or are you using the DC/OS overlay network?
from dcos-kubernetes-quickstart.
I have no firewalls running in the cluster. I don't change anything in the startup, but the number of kublets. The CIDR change was just a test.
If it's a problem with a firewall or the overlay network, wouldn't a single kublet also fail?
from dcos-kubernetes-quickstart.
It works when I start kubernetes from the default configuration.
I don't change anything in the startup, but the number of kublets.
The CIDR change was just a test.
Did you change the CIDR after installing with the default configuration (i.e., after having installed the package for the first time with the single-Kubelet, default configuration)? Changing the service CIDR after installing the package is not supported.
from dcos-kubernetes-quickstart.
No. The CIDR has always been the same except for the one test, which was after this problem had been going on for awhile.
from dcos-kubernetes-quickstart.
I finnaly got to the kube dashboard and it's reporting this.
Readiness probe failed: Get http://9.0.9.32:8081/readiness: dial tcp 9.0.9.32:8081: getsockopt: connection refused.
Is that just a readiness check on the container?
from dcos-kubernetes-quickstart.
Interestingly, if I start all three kublets and select HA, both kube-dns come up.
Does kube-dns look for the apiserver on its own node? That would explain why most times I only get one kube-dns working when not running in HA, since not running HA starts one api server on my three nodes. Also, sometimes no kube-dns starts, which would be when both kube-dns start on nodes without an apiserver.
@bmcstdio
from dcos-kubernetes-quickstart.
E0314 13:13:28.427776 1 reflector.go:201] k8s.io/dns/pkg/dns/dns.go:150: Failed to list *v1.Service: Get https://10.90.0.1:443/api/v1/services?resourceVersion=0: dial tcp 10.90.0.1:443: i/o timeout
This is not the default CIDR. And, clearly, kube-dns can't connect to the API in this case.
@bmcstdio out of curiosity, have you tried to start non-HA, scale to three nodes and scale kube-dns?
from dcos-kubernetes-quickstart.
I am not using that CIDR. I am using the default. That was a one time test out of about 100. So please ignore that I used that CIDR. I'm using the default for the other 99 tests.
from dcos-kubernetes-quickstart.
@redpine42 ok. so, please, provide us with a detailed way to reproduce this.
from dcos-kubernetes-quickstart.
I did. Use the default settings, except initialize with 3 kublet nodes.
@pires BTW, why is changing the CIDR an option if it won't work? Maybe being defensive programming is in order?
from dcos-kubernetes-quickstart.
Please, let's not get this more confusing. Let's try and reproduce this. So, please confirm the following steps:
- you have DC/OS 1.11.
- install dcos-kubernetes 1.0.1-1.9.4 (latest version) with 3 kube-nodes instead of the default 1.
kubectl -n kube-system get pods
shows 2 kube-dns pods: one is running, the other is failing.
from dcos-kubernetes-quickstart.
@pires Yes DC/OS 1.11
Yes dcos-kubernetes 1.0,101,9,4 with 3 kube-nodes.
kubectl get all shows 2 kube-dns pods: one running, the the other failing?
Also, running on centos 7 with 6 nodes. 1 Master, 2 public and 3 private.
I'm starting up the failing setup. Any output you'd like?
NAME READY STATUS RESTARTS AGE
kube-dns-754f9cd4f5-kgtcg 2/3 Running 4 2m
kube-dns-754f9cd4f5-tprcv 3/3 Running 0 2m
metrics-server-54974fd587-hj8s7 1/1 Running 0 2m
from dcos-kubernetes-quickstart.
I will try and reproduce this now.
from dcos-kubernetes-quickstart.
@bmcstdio out of curiosity, have you tried to start non-HA, scale to three nodes and scale kube-dns?
@pires yes, and everything was fine. Double-checked now, just in case, and everything is OK.
from dcos-kubernetes-quickstart.
@pires You running AWS or bare metal? I did not setup a public node. The default is 0.
I'm launching from the DC/OS UI.
Will try from the cli.
from dcos-kubernetes-quickstart.
@redpine42 make gcp
sets up my infra on Google Cloud. Also, there's no need for a public node so it doesn't matter.
Launching from the DC/OS UI or CLI is the same thing.
from dcos-kubernetes-quickstart.
Ya. That's what I thought. It's not quite the same test. What OS is GCP running? I know AWS is not quite centos 7.
Also, I upgraded from 1.10.4 to 1.11. Had a few problems with orphaned frameworks at first. Maybe the upgrade doesn't work properly?
from dcos-kubernetes-quickstart.
Honestly can't tell. Is there the possibility for us to get access to this cluster and try and check it for ourselves?
from dcos-kubernetes-quickstart.
Sure. Would you like vpn access?
One other thought. I've been testing cilium and upgraded the kernel to 4.15.7-1.el7.elrepo.x86_64. Any known issues there?
from dcos-kubernetes-quickstart.
If I had access to this cluster, networking would be the first thing I'd investigate. I don't see any issues with the kernel, on the contrary, having newer kernels is usually better.
If you can provide direct access, that would be great as I could jump in for a quick look. If this is a private cluster and more complicated steps are necessary, I am very sorry, but can't do it at this point.
from dcos-kubernetes-quickstart.
@pires It's private, but it's my system and network. I do testing on it for a client running commercial DC/OS. I just opened ssh to the master. I can send you a private key.
from dcos-kubernetes-quickstart.
Ping me in the DC/OS community slack, #kubernetes channel.
from dcos-kubernetes-quickstart.
One thing I noticed is that sometimes traceroute would work on the overlay network for a node and sometimes it wouldn't. The only intervening events being stopping kubernetes in DCOS and starting it backup. So I ran the following test.
- Reboot all private nodes.
- Install kubernetes.
- Uninstall kubernetes.
- Reinstall kubernetes.
dcos package install --yes kubernetes --options kube.json
kubectl get pods
NAME READY STATUS RESTARTS AGE
kube-dns-754f9cd4f5-r2q6b 3/3 Running 0 1m
kube-dns-754f9cd4f5-xn62t 3/3 Running 0 1m
kubernetes-dashboard-5cfddd7d5b-6kw4n 1/1 Running 0 32s
metrics-server-54974fd587-r76j2 1/1 Running 0 56s
dcos package uninstall
dcos package install --yes kubernetes --options kube.json
kubectl get pods
NAME READY STATUS RESTARTS AGE
kube-dns-754f9cd4f5-s288c 1/3 Error 2 2m
kube-dns-754f9cd4f5-xhsjf 3/3 Running 0 2m
kubernetes-dashboard-5cfddd7d5b-7vxwk 1/1 Running 0 1m
metrics-server-54974fd587-k78tm 1/1 Running 2 1m
I've run this test a few times with the same results. After a fresh reboot of the private nodes kubernetes comes up clean. Shutting down kubernetes and then starting kubernetes, usually one kubedns fails and sometimes everything fails.
I've also rebooted the servers (without shutting down kubernetes) where I had pods fail and they come back up clean with no failures.
cat kube.json
{
"kubernetes": {
"node_count": 3
}
}
from dcos-kubernetes-quickstart.
Also, am seeing this in the logs for each of node proxies, i.e., kube-node-2-kube-proxy.
W0323 09:34:39.781117 10 proxier.go:468] clusterCIDR not specified, unable to distinguish between internal and external traffic
from dcos-kubernetes-quickstart.
That warning is unrelated and harmless.
from dcos-kubernetes-quickstart.
I am still unable to reproduce this.
from dcos-kubernetes-quickstart.
Still digging. Some of the nodes the 44.128.0.x gateways I can access and some I can't.
route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 192.168.80.1 0.0.0.0 UG 100 0 0 eth1
9.0.0.0 44.128.0.1 255.255.255.0 UG 0 0 0 vtep1024
9.0.1.0 44.128.0.2 255.255.255.0 UG 0 0 0 vtep1024
9.0.2.0 44.128.0.3 255.255.255.0 UG 0 0 0 vtep1024
9.0.3.0 44.128.0.4 255.255.255.0 UG 0 0 0 vtep1024
9.0.4.0 44.128.0.5 255.255.255.0 UG 0 0 0 vtep1024
9.0.5.0 44.128.0.6 255.255.255.0 UG 0 0 0 vtep1024
9.0.5.128 0.0.0.0 255.255.255.128 U 0 0 0 d-dcos
9.0.6.0 44.128.0.7 255.255.255.0 UG 0 0 0 vtep1024
44.128.0.0 0.0.0.0 255.255.240.0 U 0 0 0 vtep1024
172.17.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0
172.18.0.0 0.0.0.0 255.255.0.0 U 0 0 0 d-dcos6
192.168.80.0 0.0.0.0 255.255.255.0 U 0 0 0 eth1
192.168.80.0 0.0.0.0 255.255.255.0 U 100 0 0 eth1
from dcos-kubernetes-quickstart.
Another error message on the node relevant to the dcos overlay.
Mar 23 17:54:00 tau.redpine.com dcos-net-env[2508]: 17:54:00.950 [error] Supervisor dcos_overlay_sup had child dcos_overlay_lashup_kv_listener started with dcos_overlay_lashup_kv_listener:start_link() at <0.8663.12> exit with reason no match of right hand value {error,3942645759,[]} in dcos_overlay_configure:configure_overlay_entry/4 line 112 in context child_terminated
Mar 23 17:54:06 tau.redpine.com dcos-net-env[2508]: 17:54:06.229 [error] Error in process <0.8797.12> on node '[email protected]' with exit value:
Mar 23 17:54:06 tau.redpine.com dcos-net-env[2508]: {{badmatch,{error,3942645759,[]}},[{dcos_overlay_configure,configure_overlay_entry,4,[{file,"/pkg/src/dcos-net/_build/prod/lib/dcos_overlay/src/dcos_overlay_configure.erl"},{line,112}]},{lists,map,2,[{file,"lists.erl"},{line,1239}]},{lists,map,2,[{file,"lists.erl"},{line,1239}]},{dcos_overlay_configure,maybe_configure,2,[{file,"/pkg/src/dcos-net/_build/prod/lib/dcos_overlay/src/dcos_overlay_configure.erl"},{line,38}]}]}
from dcos-kubernetes-quickstart.
Based upon the above error. I followed some advice on how to correct. I followed the following steps on all my nodes and my problems have gone away.
sudo systemctl stop dcos-net
sudo rm -rf /var/lib/dcos/navstar/
sudo systemctl start dcos-net
This fixed both the pod startup issues on restarts of just kubernetes in DC/OS and access to the dashboard from kubernetes-proxy.
from dcos-kubernetes-quickstart.
This is great feedback @redpine42! Thank you very much. @shaneutt maybe you are interested in adding this to your troubleshooting resources?
from dcos-kubernetes-quickstart.
Related Issues (20)
- Failed to Write to cpu.cfs_quota_us HOT 2
- Unable to setup Kubectl command in DCOS Master HOT 5
- Unable to delpoy Application on Kubernetes running on DCOS HOT 5
- etcd is unable to start on DC/OS 1.12.1 HOT 1
- authentication issue HOT 7
- Not able to access host binded volume in Kubernetes
- kubernetes CLI package error HOT 1
- Is it possible to convert same quickstart using API call rather than DCOS CLI ?
- "Bad CPU type in executable" with macOS Catalina HOT 3
- Access kubernetes nodeport service via marathon-lb?
- error during deploy at dcos_generate_config.sh
- kubernetes-dashboard not found HOT 6
- Cloud Provider option makes kube-controller-manager to fail HOT 2
- Unable to update config of kubernetes service HOT 7
- Kubernetes installation stuck in IN_PROGRESS HOT 8
- Support 1.1.0-1.10.3 or newer HOT 1
- CLI-related make goals are confusing HOT 1
- "make deploy" fails HOT 7
- Drop SSH tunnel HOT 1
- Pod cannot call it's own service HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dcos-kubernetes-quickstart.