Git Product home page Git Product logo

Comments (11)

glenselle avatar glenselle commented on June 28, 2024

Not sure if this may be related: https://github.com/coreos/bugs/issues/825

from corekube.

metral avatar metral commented on June 28, 2024

Thank you for filing an issue.

From the logs it seems that the master is not able to reach the nodes in
the cluster advertised by the discovery service in the discovery node
(which runs etcd). Flannel too depends on etcd so the discovery node being
down or encountering issues seems like the source of the problem.

Try checking the discovery node to see if the etcd container is in fact
running and attempt restarting it's service if it's not.

Assuming etcd is operational, on the master do a fleetctl list-machines
to see if those nodes are visible. If so, restart the flannel service and
attempt to see if the overlord has what it needs to provision the cluster
in the overlord node.

If none of this works or seems a bit daunting and you just want to get
something working, simply create a new cluster from scratch - there are a
lot of moving parts in the project and it isn't battle hardened, not to
mention, obscurities in cloud provisioning aren't uncommon.

Hope this helps

On Thursday, February 4, 2016, Glen Selle [email protected] wrote:

Not sure if this may be related: coreos/bugs#825
http://url


Reply to this email directly or view it on GitHub
#23 (comment).

-Mike Metral

from corekube.

glenselle avatar glenselle commented on June 28, 2024

Thanks for the response. First, this is not production cluster. I'm literally copy-pasting the template into the orchestration service and creating the stack--with no success. Can you verify it works for you?

Here are some more logs from kubernetes-master:

kubernetes-master ~ # systemctl status flannel -l
● flannel.service
   Loaded: loaded (/etc/systemd/system/flannel.service; static; vendor preset: disabled)
   Active: active (running) since Thu 2016-02-04 17:03:25 UTC; 11min ago
  Process: 1314 ExecStartPre=/usr/bin/etcdctl mk /coreos.com/network/config {"Network":"10.244.0.0/15", "Backend": {"Type": "vxlan"}} (code=exited, status=0/SUCCESS)
 Main PID: 1321 (flanneld)
   CGroup: /system.slice/flannel.service
           └─1321 /opt/bin/flanneld -iface=eth2

Feb 04 17:03:25 kubernetes-master flanneld[1321]: I0204 17:03:25.871428 01321 main.go:275] Installing signal handlers
Feb 04 17:03:25 kubernetes-master flanneld[1321]: I0204 17:03:25.880686 01321 main.go:188] Using 192.168.3.4 as external interface
Feb 04 17:03:25 kubernetes-master flanneld[1321]: I0204 17:03:25.885072 01321 main.go:189] Using 192.168.3.4 as external endpoint
Feb 04 17:03:25 kubernetes-master flanneld[1321]: I0204 17:03:25.906731 01321 etcd.go:204] Picking subnet in range 10.244.1.0 ... 10.245.255.0
Feb 04 17:03:25 kubernetes-master flanneld[1321]: I0204 17:03:25.913410 01321 etcd.go:84] Subnet lease acquired: 10.244.5.0/24
Feb 04 17:03:25 kubernetes-master flanneld[1321]: I0204 17:03:25.917238 01321 vxlan.go:153] Watching for L3 misses
Feb 04 17:03:25 kubernetes-master flanneld[1321]: I0204 17:03:25.918510 01321 vxlan.go:159] Watching for new subnet leases
Feb 04 17:03:25 kubernetes-master flanneld[1321]: I0204 17:03:25.919716 01321 vxlan.go:273] Handling initial subnet events
Feb 04 17:03:25 kubernetes-master flanneld[1321]: I0204 17:03:25.919731 01321 device.go:159] calling GetL2List() dev.link.Index: 5
Feb 04 17:13:26 kubernetes-master flanneld[1321]: E0204 17:13:26.596561 01321 watch.go:41] Watch subnets: client: etcd cluster is unavailable or misconfigured
kubernetes-master ~ # systemctl status etcd -l
● etcd.service - etcd
   Loaded: loaded (/usr/lib64/systemd/system/etcd.service; static; vendor preset: disabled)
  Drop-In: /run/systemd/system/etcd.service.d
           └─10-oem.conf
   Active: active (running) since Thu 2016-02-04 17:03:25 UTC; 11min ago
 Main PID: 1291 (etcd)
   CGroup: /system.slice/etcd.service
           └─1291 /usr/bin/etcd

Feb 04 17:03:25 kubernetes-master systemd[1]: Started etcd.
Feb 04 17:03:25 kubernetes-master etcd[1291]: [etcd] Feb  4 17:03:25.269 INFO      | b45fd7970b9d4961847d5de0d36224be is starting a new cluster
Feb 04 17:03:25 kubernetes-master etcd[1291]: [etcd] Feb  4 17:03:25.274 INFO      | etcd server [name b45fd7970b9d4961847d5de0d36224be, listen on :4001, advertised url http://127.0.0.1:4001]
Feb 04 17:03:25 kubernetes-master etcd[1291]: [etcd] Feb  4 17:03:25.277 INFO      | peer server [name b45fd7970b9d4961847d5de0d36224be, listen on :7001, advertised url http://127.0.0.1:7001]
Feb 04 17:03:25 kubernetes-master etcd[1291]: [etcd] Feb  4 17:03:25.279 INFO      | b45fd7970b9d4961847d5de0d36224be starting in peer mode
Feb 04 17:03:25 kubernetes-master etcd[1291]: [etcd] Feb  4 17:03:25.281 INFO      | b45fd7970b9d4961847d5de0d36224be: state changed from 'initialized' to 'follower'.
Feb 04 17:03:25 kubernetes-master etcd[1291]: [etcd] Feb  4 17:03:25.282 INFO      | b45fd7970b9d4961847d5de0d36224be: state changed from 'follower' to 'leader'.
Feb 04 17:03:25 kubernetes-master etcd[1291]: [etcd] Feb  4 17:03:25.282 INFO      | b45fd7970b9d4961847d5de0d36224be: leader changed from '' to 'b45fd7970b9d4961847d5de0d36224be'.

The master machine shows this when fleetctl list-machines:

kubernetes-master ~ # fleetctl list-machines
MACHINE     IP      METADATA
b45fd797... 10.209.69.190   kubernetes_role=master

The discovery machine has its container up and running:

discovery ~ # docker ps -a
CONTAINER ID        IMAGE                        COMMAND                  CREATED             STATUS              PORTS               NAMES
faa50d1b8fd9        quay.io/coreos/etcd:v2.2.2   "/etcd -name discover"   22 minutes ago      Up 22 minutes                           discovery

And when I try to dump the logs from the discovery machine I get this:

 discovery ~ # docker logs discovery 
"logs" command is supported only for "json-file" logging driver (got: journald)

I swapped out CoreOS for a newer version to dump the logs from the discovery container, but that throws all the other dependencies out of whack. Is there nothing special I have to do? And can you verify that using the Rackspace Control panel orchestration should work fine? I'm not using any of the OpenStack CLI tools.

from corekube.

glenselle avatar glenselle commented on June 28, 2024

Running the template with CoreOS 835.9.0 yeilds the following logs for the discovery machine:

discovery ~ # docker logs discovery
2016-02-04 17:29:15.573550 I | etcdmain: etcd Version: 2.2.2
2016-02-04 17:29:15.573668 I | etcdmain: Git SHA: b4bddf6
2016-02-04 17:29:15.573776 I | etcdmain: Go Version: go1.5.1
2016-02-04 17:29:15.573821 I | etcdmain: Go OS/Arch: linux/amd64
2016-02-04 17:29:15.573867 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
2016-02-04 17:29:15.573910 W | etcdmain: no data-dir provided, using default data-dir ./discovery.etcd
2016-02-04 17:29:15.574084 I | etcdmain: listening for peers on http://10.210.198.131:2380
2016-02-04 17:29:15.574198 I | etcdmain: listening for peers on http://10.210.198.131:7001
2016-02-04 17:29:15.574286 I | etcdmain: listening for client requests on http://0.0.0.0:2379
2016-02-04 17:29:15.574364 I | etcdmain: listening for client requests on http://0.0.0.0:4001
2016-02-04 17:29:15.574648 I | etcdserver: name = discovery
2016-02-04 17:29:15.574696 I | etcdserver: data dir = discovery.etcd
2016-02-04 17:29:15.574732 I | etcdserver: member dir = discovery.etcd/member
2016-02-04 17:29:15.574768 I | etcdserver: heartbeat = 100ms
2016-02-04 17:29:15.574822 I | etcdserver: election = 1000ms
2016-02-04 17:29:15.574871 I | etcdserver: snapshot count = 10000
2016-02-04 17:29:15.574912 I | etcdserver: advertise client URLs = http://10.210.198.131:2379,http://10.210.198.131:4001
2016-02-04 17:29:15.574950 I | etcdserver: initial advertise peer URLs = http://10.210.198.131:2380,http://10.210.198.131:7001
2016-02-04 17:29:15.575003 I | etcdserver: initial cluster = discovery=http://10.210.198.131:2380,discovery=http://10.210.198.131:7001
2016-02-04 17:29:15.578138 I | etcdserver: starting member 6c08b731013f69ff in cluster 5191546481e15bf6
2016-02-04 17:29:15.578256 I | raft: 6c08b731013f69ff became follower at term 0
2016-02-04 17:29:15.578354 I | raft: newRaft 6c08b731013f69ff [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
2016-02-04 17:29:15.578424 I | raft: 6c08b731013f69ff became follower at term 1
2016-02-04 17:29:15.578778 I | etcdserver: starting server... [version: 2.2.2, cluster version: to_be_decided]
2016-02-04 17:29:15.580272 N | etcdserver: added local member 6c08b731013f69ff [http://10.210.198.131:2380 http://10.210.198.131:7001] to cluster 5191546481e15bf6
2016-02-04 17:29:16.079065 I | raft: 6c08b731013f69ff is starting a new election at term 1
2016-02-04 17:29:16.079281 I | raft: 6c08b731013f69ff became candidate at term 2
2016-02-04 17:29:16.079402 I | raft: 6c08b731013f69ff received vote from 6c08b731013f69ff at term 2
2016-02-04 17:29:16.079500 I | raft: 6c08b731013f69ff became leader at term 2
2016-02-04 17:29:16.079612 I | raft: raft.node: 6c08b731013f69ff elected leader 6c08b731013f69ff at term 2
2016-02-04 17:29:16.080383 I | etcdserver: published {Name:discovery ClientURLs:[http://10.210.198.131:2379 http://10.210.198.131:4001]} to cluster 5191546481e15bf6
2016-02-04 17:29:16.080541 I | etcdserver: setting up the initial cluster version to 2.2
2016-02-04 17:29:16.084480 N | etcdserver: set the initial cluster version to 2.2

The overlord logs show:

overlord ~ # docker logs overlord
2016/02/04 17:32:47 ------------------------------------------------------------
2016/02/04 17:32:47 Current # of machines seen/deployed to: (0)
2016/02/04 17:32:47 ------------------------------------------------------------
2016/02/04 17:32:47 Current # of machines discovered: (0)
2016/02/04 17:32:48 ------------------------------------------------------------
2016/02/04 17:32:48 Current # of machines seen/deployed to: (0)
2016/02/04 17:32:48 ------------------------------------------------------------
2016/02/04 17:32:48 Current # of machines discovered: (0)
2016/02/04 17:32:49 ------------------------------------------------------------
2016/02/04 17:32:49 Current # of machines seen/deployed to: (0)
2016/02/04 17:32:49 ------------------------------------------------------------

The kubernetes master node is showing the following:

kubernetes-master ~ # etcdctl cluster-health 
cluster may be unhealthy: failed to list members
Error:  unexpected status code 404
kubernetes-master ~ # journalctl -u etcd.service
-- Logs begin at Thu 2016-02-04 17:29:20 UTC, end at Thu 2016-02-04 17:37:02 UTC. --
Feb 04 17:29:34 kubernetes-master systemd[1]: Started etcd.
Feb 04 17:29:34 kubernetes-master etcd[1246]: [etcd] Feb  4 17:29:34.752 INFO      | d2bc042a0f624cdda5df4f2ac15375c6 is starting a new cluster
Feb 04 17:29:34 kubernetes-master etcd[1246]: [etcd] Feb  4 17:29:34.754 INFO      | etcd server [name d2bc042a0f624cdda5df4f2ac15375c6, listen on :4001, advertised url http://127.0
Feb 04 17:29:34 kubernetes-master etcd[1246]: [etcd] Feb  4 17:29:34.756 INFO      | peer server [name d2bc042a0f624cdda5df4f2ac15375c6, listen on :7001, advertised url http://127.0
Feb 04 17:29:34 kubernetes-master etcd[1246]: [etcd] Feb  4 17:29:34.757 INFO      | d2bc042a0f624cdda5df4f2ac15375c6 starting in peer mode
Feb 04 17:29:34 kubernetes-master etcd[1246]: [etcd] Feb  4 17:29:34.758 INFO      | d2bc042a0f624cdda5df4f2ac15375c6: state changed from 'initialized' to 'follower'.
Feb 04 17:29:34 kubernetes-master etcd[1246]: [etcd] Feb  4 17:29:34.758 INFO      | d2bc042a0f624cdda5df4f2ac15375c6: state changed from 'follower' to 'leader'.
Feb 04 17:29:34 kubernetes-master etcd[1246]: [etcd] Feb  4 17:29:34.759 INFO      | d2bc042a0f624cdda5df4f2ac15375c6: leader changed from '' to 'd2bc042a0f624cdda5df4f2ac15375c6'.
kubernetes-master ~ # journalctl -u flannel.service
-- Logs begin at Thu 2016-02-04 17:29:20 UTC, end at Thu 2016-02-04 17:37:07 UTC. --
Feb 04 17:29:41 kubernetes-master systemd[1]: Starting flannel.service...
Feb 04 17:29:41 kubernetes-master etcdctl[1277]: Error:  unexpected status code 404
Feb 04 17:29:41 kubernetes-master systemd[1]: flannel.service: Control process exited, code=exited status=4
Feb 04 17:29:41 kubernetes-master systemd[1]: Failed to start flannel.service.
Feb 04 17:29:41 kubernetes-master systemd[1]: flannel.service: Unit entered failed state.
Feb 04 17:29:41 kubernetes-master systemd[1]: flannel.service: Failed with result 'exit-code'.
Feb 04 17:29:46 kubernetes-master systemd[1]: flannel.service: Service hold-off time over, scheduling restart.
Feb 04 17:29:46 kubernetes-master systemd[1]: Stopped flannel.service.
Feb 04 17:29:46 kubernetes-master systemd[1]: Starting flannel.service...
Feb 04 17:29:46 kubernetes-master etcdctl[1284]: Error:  unexpected status code 404
Feb 04 17:29:46 kubernetes-master systemd[1]: flannel.service: Control process exited, code=exited status=4
Feb 04 17:29:46 kubernetes-master systemd[1]: Failed to start flannel.service.
Feb 04 17:29:46 kubernetes-master systemd[1]: flannel.service: Unit entered failed state.
Feb 04 17:29:46 kubernetes-master systemd[1]: flannel.service: Failed with result 'exit-code'.
Feb 04 17:29:52 kubernetes-master systemd[1]: flannel.service: Service hold-off time over, scheduling restart.
Feb 04 17:29:52 kubernetes-master systemd[1]: Stopped flannel.service.
Feb 04 17:29:52 kubernetes-master systemd[1]: Starting flannel.service...
Feb 04 17:29:52 kubernetes-master etcdctl[1295]: Error:  unexpected status code 404
Feb 04 17:29:52 kubernetes-master systemd[1]: flannel.service: Control process exited, code=exited status=4
Feb 04 17:29:52 kubernetes-master systemd[1]: Failed to start flannel.service.

from corekube.

metral avatar metral commented on June 28, 2024

I just copied / pasted the template https://github.com/metral/corekube/blob/master/corekube-cloudservers.yaml into the Rackspace Control Panel Orchestration/Stacks in the IAD region and I can confirm that it is in fact working as is

from corekube.

metral avatar metral commented on June 28, 2024

The master & discovery look fine in your 1st go around.

On the 2nd attempt, you shouldn't have to update the CoreOS version to get it working, but in that attempt it looks like none of the nodes are finding the discovery service - this is very odd that you're now hitting this twice.

What region are you operating in? Have you tried switching that? Sometimes regions can be undergoing maintenance / upgrades which could lead to the inconsistencies we're seeing.

from corekube.

glenselle avatar glenselle commented on June 28, 2024

I've been doing everything in ORD. Interestingly, running the template in IAD yeilds much better results:

CoreOS stable (808.0.0)
Update Strategy: No Reboots

overlord ~ # etcdctl cluster-health 
unexpected status code 404

overlord ~ # fleetctl list-machines
MACHINE     IP      METADATA
6c98cbbe... 10.176.135.181  kubernetes_role=minion
7dd8042e... 10.176.199.143  kubernetes_role=minion
9251561e... 10.176.132.211  kubernetes_role=minion
c5778c8b... 10.176.133.64   kubernetes_role=master
eb2aa136... 10.209.68.196   kubernetes_role=overlord

overlord ~ # journalctl -u etcd.service
-- Logs begin at Thu 2016-02-04 19:03:11 UTC, end at Thu 2016-02-04 19:04:32 UTC. --
Feb 04 19:03:24 overlord systemd[1]: Started etcd.
Feb 04 19:03:24 overlord etcd[1197]: [etcd] Feb  4 19:03:24.947 INFO      | Discovery via http://10.176.130.52:2379 using prefix discovery/2z9RJugdpvAtNG1Qpgo2t361YMb7tBLN.
Feb 04 19:03:26 overlord etcd[1197]: [etcd] Feb  4 19:03:26.157 INFO      | Discovery found peers [http://10.176.133.64:7001]
Feb 04 19:03:26 overlord etcd[1197]: [etcd] Feb  4 19:03:26.157 INFO      | Discovery fetched back peer list: [10.176.133.64:7001]
Feb 04 19:03:27 overlord etcd[1197]: [etcd] Feb  4 19:03:27.167 INFO      | Send Join Request to http://10.176.133.64:7001/join
Feb 04 19:03:28 overlord etcd[1197]: [etcd] Feb  4 19:03:28.259 INFO      | overlord joined the cluster via peer 10.176.133.64:7001
Feb 04 19:03:28 overlord etcd[1197]: [etcd] Feb  4 19:03:28.263 INFO      | etcd server [name overlord, listen on :4001, advertised url http://10.209.68.196:4001]
Feb 04 19:03:28 overlord etcd[1197]: [etcd] Feb  4 19:03:28.263 INFO      | peer server [name overlord, listen on :7001, advertised url http://10.209.68.196:7001]
Feb 04 19:03:28 overlord etcd[1197]: [etcd] Feb  4 19:03:28.263 INFO      | overlord starting in peer mode
Feb 04 19:03:28 overlord etcd[1197]: [etcd] Feb  4 19:03:28.263 INFO      | overlord: state changed from 'initialized' to 'follower'.
Feb 04 19:03:28 overlord etcd[1197]: [etcd] Feb  4 19:03:28.272 INFO      | overlord: peer added: 'kubernetes_master'
Feb 04 19:03:41 overlord etcd[1197]: [etcd] Feb  4 19:03:41.065 INFO      | overlord: peer added: 'kubernetes_minion_0'
Feb 04 19:03:41 overlord etcd[1197]: [etcd] Feb  4 19:03:41.316 INFO      | overlord: peer added: 'kubernetes_minion_1'
Feb 04 19:03:46 overlord etcd[1197]: [etcd] Feb  4 19:03:46.008 INFO      | overlord: peer added: 'kubernetes_minion_2'
Feb 04 19:03:57 overlord etcd[1197]: [etcd] Feb  4 19:03:57.175 INFO      | overlord: warning: heartbeat near election timeout: 1.131447065s

Is that 404 from etcdctl an issue, given that everything else looks fine?

from corekube.

glenselle avatar glenselle commented on June 28, 2024

Also, is there a reason the template defaults to a 4 GB General Purpose v1 machine? Am I going to cause issues if I drop down to a 2GB or even a 1GB machine?

from corekube.

metral avatar metral commented on June 28, 2024

No the 404 is not an issue - you'll get that running from any node that isn't itself the discovery node as the endpoint it uses defaults to 127.0.0.1:4001. To get that command working you must provide etcdctl the direct IP:Port endpoint of the discovery node as such using the discovery node's ServiceNet ip addr:

etcdctl --endpoint="<SERVICE_NET_IP>:4001" cluster-health
--> member 61dbeb358f6ab43c is healthy: got healthy result from http://<SERVICE_NET_IP>:2379
--> cluster is healthy

from corekube.

metral avatar metral commented on June 28, 2024

As far as flavor sizes go, 4GB was the minimum required to get the k8s guestbook example's Redis server to work - I was hitting RAM issues in the containers using anything less

from corekube.

glenselle avatar glenselle commented on June 28, 2024

Ah. I'm guessing my issues were because of the CoreOS 835.9.0 image I was using. Thanks for the pointers. Closing since this my issue.

from corekube.

Related Issues (13)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.