cloudnativelabs / kube-router Goto Github PK

View Code? Open in Web Editor NEW

2.2K 2.2K 463.0 76.09 MB

Kube-router, a turnkey solution for Kubernetes networking.

Home Page: https://kube-router.io

License: Apache License 2.0

Makefile 1.24% Go 97.88% Shell 0.54% Dockerfile 0.33% Vim Script 0.01%

bgp docker gobgp iptables ipvs kubernetes kubernetes-networking kubernetes-service

kube-router's People

Contributors

Stargazers

Watchers

Forkers

rimusz-lab awesome-golang awesome-docker zheng-yang wearefair dd-repo toecuta keontang markjacksonfishing cacalote yangchuansheng bzub diannaowa zjj2wry felipejfc thoro nuaays sxauyhz goooogs alexzheng0000 ryarnyah qinzhao168 jeffchelko heidsoft pilsy digitalocean dl00 sun363587351 andrewsykim dapengjiao lzbgt guangyouyu martinyunify mavenugo drobinson123 cjbrigato cvhbsk saurabhtmba2005 hzxuzhonghu barthoda cmuh youngdou guoyouzhang zhengweisk jeremypogue x82423990 ivanthelad sejeff sysbot heyibo2003 cherry-hyx chuangtse hongjunni dougmack xlogin tvl2386 gcloudnative rootzhanggb murali-reddy huyipow sergeylanzman jpds etsangsplk alebsack cmingxu gruizesteban ut0mt8 jjo lavajnv markleckie chinakevinguo dlamotte hetaixiang killcity farcaller jsrz xanonid yinwenqin itc3-devops itc3admin jengo bretagne-peiqi jetlwx johanot asa-taka adhipati-blambangan raghu999 muhazzz steigr rmenn zhenghc es lubinsz dongjun666 vadorovsky higreg marrobi mwiesmueller andor44 jdconti

kube-router's Issues

Create a kube-router Helm Chart

To increase testing/adoption and further automate and standardize configuration/deployment. It should probably live in the official Chart repository for maximum exposure and help with issues, so it won't be added to this repository in that case. I will work on this and close this ticket once it's added to the official Chart repo.

References

Helm

Bootkube example documentation

I've successfully created a new Kubernetes cluster with Bootkube using kube-router instead of kube-proxy/flannel. Creating this issue to remind me to document how it was done and share with this project. Thanks!

Full-mesh BGP requires restart of kube-router

I created a brand new kubernetes cluster with kube-router managing pod-to-pod networking, service proxy, and namespace firewall. Everything appears OK with the first node (node1-dev), however I get different results when adding additional nodes (node2-dev), specifically with full BGP mesh setup.

IPAM works, service discovery/IPVS seem to work. But communication on non-publically routable IPs (pod/service cidr) between nodes does not work.

I see in the kube-router logs for node1-dev it seems to detect a peering attempt with node2-dev, however I see the following error:

time="2017-05-25T01:40:54Z" level=info msg="Can't find configuration for a new passive connection from:10.10.3.2" Topic=Peer

10.10.3.2 is the IP of node2-dev.

To hopefully help with troubleshooting, here's some ip/gobgp output pertaining to both nodes.

node1-dev ip route and ip addr

default via 10.10.10.1 dev enp0s25  proto static
10.2.0.0/24 dev kube-bridge  proto kernel  scope link  src 10.2.0.1
10.10.0.0/16 dev enp0s25  proto kernel  scope link  src 10.10.3.1
blackhole 10.10.150.1
blackhole 10.10.250.1
172.17.0.0/16 dev docker0  proto kernel  scope link  src 172.17.0.1


1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet 10.10.250.1/32 brd 10.10.250.1 scope global lo
       valid_lft forever preferred_lft forever
    inet 10.10.150.1/32 brd 10.10.150.1 scope global lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: enp0s25: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 00:1e:4f:92:b2:38 brd ff:ff:ff:ff:ff:ff
    inet 10.10.3.1/16 brd 10.10.255.255 scope global enp0s25
       valid_lft forever preferred_lft forever
    inet6 fe80::21e:4fff:fe92:b238/64 scope link
       valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:6b:0c:bc:c8 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 scope global docker0
       valid_lft forever preferred_lft forever
4: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 92:dd:c8:bc:f2:53 brd ff:ff:ff:ff:ff:ff
5: kube-dummy-if: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default
    link/ether 3e:de:1f:dd:7f:28 brd ff:ff:ff:ff:ff:ff
    inet 10.3.0.10/32 scope link kube-dummy-if
       valid_lft forever preferred_lft forever
    inet 10.3.0.1/32 scope link kube-dummy-if
       valid_lft forever preferred_lft forever
    inet6 fe80::3cde:1fff:fedd:7f28/64 scope link
       valid_lft forever preferred_lft forever
6: kube-bridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 0a:58:0a:02:00:01 brd ff:ff:ff:ff:ff:ff
    inet 10.2.0.1/24 scope global kube-bridge
       valid_lft forever preferred_lft forever
    inet6 fe80::f4a2:f4ff:feea:2a18/64 scope link
       valid_lft forever preferred_lft forever
7: veth9683a0b6@docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master kube-bridge state UP group default
    link/ether e6:a5:1d:ef:e7:8d brd ff:ff:ff:ff:ff:ff
    inet6 fe80::e4a5:1dff:feef:e78d/64 scope link
       valid_lft forever preferred_lft forever
8: vethdbacf2a7@docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master kube-bridge state UP group default
    link/ether 96:37:ff:9c:91:7e brd ff:ff:ff:ff:ff:ff
    inet6 fe80::9437:ffff:fe9c:917e/64 scope link
       valid_lft forever preferred_lft forever
9: veth06603cee@docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master kube-bridge state UP group default
    link/ether 02:42:98:1e:6f:35 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::42:98ff:fe1e:6f35/64 scope link
       valid_lft forever preferred_lft forever
10: vetha1ffb2fd@docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master kube-bridge state UP group default
    link/ether c2:d9:d8:93:af:ff brd ff:ff:ff:ff:ff:ff
    inet6 fe80::c0d9:d8ff:fe93:afff/64 scope link
       valid_lft forever preferred_lft forever
11: vethbc1e38cc@docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master kube-bridge state UP group default
    link/ether 16:37:4f:d2:43:61 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::1437:4fff:fed2:4361/64 scope link
       valid_lft forever preferred_lft forever

node2-dev ip route and ip addr

default via 10.10.10.1 dev enp0s25  proto static
10.10.0.0/16 dev enp0s25  proto kernel  scope link  src 10.10.3.2
blackhole 10.10.150.1
blackhole 10.10.250.1
172.17.0.0/16 dev docker0  proto kernel  scope link  src 172.17.0.1


1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet 10.10.250.1/32 brd 10.10.250.1 scope global lo
       valid_lft forever preferred_lft forever
    inet 10.10.150.1/32 brd 10.10.150.1 scope global lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: enp0s25: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 00:1e:4f:92:ad:ef brd ff:ff:ff:ff:ff:ff
    inet 10.10.3.2/16 brd 10.10.255.255 scope global enp0s25
       valid_lft forever preferred_lft forever
    inet6 fe80::21e:4fff:fe92:adef/64 scope link
       valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:e7:ce:9d:9b brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 scope global docker0
       valid_lft forever preferred_lft forever
4: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 4a:bd:98:93:1b:19 brd ff:ff:ff:ff:ff:ff
5: kube-dummy-if: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default
    link/ether c6:31:ca:49:e8:2a brd ff:ff:ff:ff:ff:ff
    inet 10.3.0.10/32 scope link kube-dummy-if
       valid_lft forever preferred_lft forever
    inet 10.3.0.1/32 scope link kube-dummy-if
       valid_lft forever preferred_lft forever
    inet6 fe80::c431:caff:fe49:e82a/64 scope link
       valid_lft forever preferred_lft forever

gobgp neighbor for both nodes:

$GOPATH/bin/gobgp -u node1-dev.zbrbdl neighbor; echo "---"; $GOPATH/bin/gobgp -u node2-dev.zbrbdl neighbor
Peer AS Up/Down State       |#Received  Accepted
---
Peer         AS Up/Down State       |#Received  Accepted
10.10.3.1 64512   never Active      |        0         0

pod networking between the nodes in different subnets

Current model host gateway based routing of Kube-router assumes nodes are L2 adjacent or in same subnet. this should work fine for most cases. But if we have a cluster with nodes in different subnets, i.e nodes are not in L2 adjacent., then host routing based approach for cross node pod-to-pod connectivity does not work.

Perhaps easy way to address is to use VXLAN encapsulation. So routes are set such that if destination node is in not in same subnet then use the interface to VXLAN encap.

metrics: Add port configuration and Service definition

For #66 - Phase 1

Add a --metrics-port option to change the port.
- --metrics-port=0 means disabled. Default port should be uncommon.
Provide a k8s Service definition for kube-router metrics port to our example manifests

Dynamic load balancing

IPVS has this nice LB method http://kb.linuxvirtualserver.org/wiki/Dynamic_Feedback_Load_Balancing_Scheduling

Which is all the more relevent in case of distrubuted load balancing requirements of 'ClusterIP' and NodePort service types. Each node doing load balancing in round robin fashion has below limitations.

if connections are mixture of short lived and long lived then we will end up in imbalance
each node is acting on the local connection information it has. Due to distrubuted nature each node ends up doing load balancing without taking account of other nodes or global state

Here is the proposal:

Implement monitor daemon that collects connection per endpoint from the node using contrack
use in-memory distrubuted key-value store to persist the data.
use the global view of connections to dynamically adjust the weights on each node

update kops documentation, dont need user setting src-dest check to false

https://github.com/kubernetes/kops/blob/master/docs/networking.md#kube-router-example-for-cni-ipvs-based-service-proxy-and-network-policy-enforcer

requiremnet to set source-destination check to false is not required. Kube-router automatically does it #35 . Raise PR in KOPS once we make new release of kube-router.

Advertise the cluster IP when service is added to peers through GoBgp when service is added

In some use-cases its desirable for external (outside cluster) access for the cluster IP's. While NodePort can be used its not convenient to use non standard node ports. Its more familiar to use some thing like cluster-ip:80 thank node-ip:node-port.

Add a flag, when true, add a route to RIB that GoBgp can advertise to its peers. Ofcourse we will endup every node advertising the cluster IP. Upstream routers can use ECMP to load balance.

ERROR: logging before flag.Parse

With glog updates we're running into this bug, kubernetes/kubernetes#17162

There's a workaround in there that I'll try out.

Support Services with externalIPs

When a Service has externalIPs defined, kube-proxy binds the service ports to those IPs if they exist on the node it's running on. Although with iBGP peering a network admin is able to expose service IPs, it may still be beneficial to users to specify an IP that's not in the service IP CIDR. Adding this support to kube-router would also streamline transitioning away from kube-proxy.

References

Kubernetes user documentation: service/#external-ips
kube-proxy implementation: proxier.go

on BGP peer down, advertised route from peer is not cleaned-up in local routing table

@rmb938 reported that on BGP peer disconnect, advertised route from peer is not cleanup in local routing table. Kube-router should listen to BGP disconnect and remove the routes.

Document how to contribute to the project (CONTRIBUTING.md)

runtime: program exceeds 10000-thread limit

Threads seem to be piling up on my kube-router pods and it is eventually killed and restarted.

I0607 14:14:54.708042       1 network_services_controller.go:407] ipvs service 10.3.0.15:tcp:2379 already exists so returning
I0607 14:14:54.885319       1 network_services_controller.go:448] ipvs destination 10.10.2.1:2379 already exists in the ipvs service 10.3.0.15:tcp:2379 so not adding destination
I0607 14:14:55.074909       1 network_services_controller.go:448] ipvs destination 10.10.2.2:2379 already exists in the ipvs service 10.3.0.15:tcp:2379 so not adding destination
I0607 14:14:55.249060       1 network_services_controller.go:448] ipvs destination 10.10.2.3:2379 already exists in the ipvs service 10.3.0.15:tcp:2379 so not adding destination
I0607 14:14:55.341417       1 network_policy_controller.go:94] Performing periodic syn of the iptables to reflect network policies
I0607 14:14:55.342273       1 network_services_controller.go:103] Performing periodic syn of the ipvs services and server to reflect desired state of kubernetes services and endpoints
runtime: program exceeds 10000-thread limit
fatal error: thread exhaustion

runtime stack:
runtime.throw(0x182a6f1, 0x11)
        /usr/local/go/src/runtime/panic.go:566 +0x95
runtime.checkmcount()
        /usr/local/go/src/runtime/proc.go:486 +0xa4
runtime.mcommoninit(0xc44146bc00)
        /usr/local/go/src/runtime/proc.go:506 +0xd5
runtime.allocm(0xc42001f500, 0x190e5c0, 0xc400000001)
        /usr/local/go/src/runtime/proc.go:1286 +0x9b
runtime.newm(0x190e5c0, 0xc42001f500)
        /usr/local/go/src/runtime/proc.go:1555 +0x39
runtime.startm(0xc42001f500, 0x100000001)
        /usr/local/go/src/runtime/proc.go:1642 +0x181
runtime.wakep()
        /usr/local/go/src/runtime/proc.go:1723 +0x57
runtime.resetspinning()
        /usr/local/go/src/runtime/proc.go:2039 +0x8b
runtime.schedule()
        /usr/local/go/src/runtime/proc.go:2127 +0x136
runtime.mstart1()
        /usr/local/go/src/runtime/proc.go:1136 +0xd8
runtime.mstart()
        /usr/local/go/src/runtime/proc.go:1096 +0x64

goroutine 1 [chan receive, 27 minutes]:
github.com/cloudnativelabs/kube-router/app.(*KubeRouter).Run(0xc420402560, 0xc420402560, 0x0)
        /home/kube/go/src/github.com/cloudnativelabs/kube-router/app/server.go:152 +0x228
main.main()
        /home/kube/go/src/github.com/cloudnativelabs/kube-router/kube-router.go:37 +0x13c

goroutine 17 [syscall, 27 minutes, locked to thread]:
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:2086 +0x1

goroutine 5 [syscall, 27 minutes]:
os/signal.signal_recv(0x0)
        /usr/local/go/src/runtime/sigqueue.go:116 +0x157
os/signal.loop()
        /usr/local/go/src/os/signal/signal_unix.go:22 +0x22
created by os/signal.init.1
        /usr/local/go/src/os/signal/signal_unix.go:28 +0x41

goroutine 6 [chan receive]:
github.com/cloudnativelabs/kube-router/vendor/github.com/golang/glog.(*loggingT).flushDaemon(0x246f740)
        /home/kube/go/src/github.com/cloudnativelabs/kube-router/vendor/github.com/golang/glog/glog.go:879 +0x7a
created by github.com/cloudnativelabs/kube-router/vendor/github.com/golang/glog.init.1
        /home/kube/go/src/github.com/cloudnativelabs/kube-router/vendor/github.com/golang/glog/glog.go:410 +0x21d

I've attached kube-router.log.gz which is the full log the snippet above came from.

On one of the nodes I can see the threads increasing slowly this way:

core@node1 ~ $ ps aux|grep kube-router
root     1834843  8.9  0.2 26928172 242768 ?     Ssl  14:14   0:31 /kube-router --run-router=true --run-firewall=true --run-service-proxy=true --cluster-cidr=10.2.0.0/16 --advertise-cluster-ip --cluster-asn=64512 --peer-asn=64512 --peer-router=10.10.10.33 --kubeconfig=/etc/kubernetes/kubeconfig
core     1839688  0.0  0.0   6736   936 pts/0    S+   14:20   0:00 grep --colour=auto kube-router
core@node1 ~ $ ps huH p 1834843|wc -l
2365
core@node1 ~ $ ps huH p 1834843|wc -l
2367
core@node1 ~ $ ps huH p 1834843|wc -l
2371
core@node1 ~ $ ps huH p 1834843|wc -l
2388

Use service annotations to choose IPVS load balancing method

Since Service manifest does not have support for choosing load balancing method, use service annotations to add meta data to service to specify load balancing method. Use this method details to configure ipvs service.

On side note how useful it is needs to be analyzed. Given that each node is making decision on the knowledge of connections it is aware of. Even if a node performs least connection load balancing it does not necessarily mean endpoint has least connection. As there can be connetions that are load balanced from across the cluster nodes. This is nature of distributed load balancer.

add support for Kube-roter in kops

See if kops can get support kube-router
kubernetes/kops#2606

restrict advertising all cluster IP's

Even though advertising cluster IP's is optional (with --advertise-cluster-ip). Once flag is enabled all the cluster IP's are advertised which is not desirable. For e.g Db. so use an annotation to selectivley advertise the service which user request to.

Set goreleaser to create a draft release

So we can edit the release before it's published.

Have test framework in place for CI

Quality Assurance: Have a basic test framework in place that can find some issues/regressions without manual testing. Using and passing kubernetes conformance tests would also help alleviate fears of new software.

If service manifest has "SessionAffinity" set, then configure IPVS to provide session persistence

If service manifest has "SessionAffinity" set, then configure IPVS to provide session persistence. Only persistence type supported by manifest is "ClientIP" so we just nees to provide client ip based session persistence.

AWS instances source-destination checks disabled

For AWS ec2 instances to send and recieve traffic from/to pods we need to disable source-destination check. Currently this is manual step, which can be automated.

In cluster deployers like KOPS, pod running on the master has access to EC2 API. So kube-router can detect when running on AWS, and program disabling source-destination checks if it has access to EC2 API.

This is one action item pending with KOPS integration.

Explore kube-router as ingress controller

From the TODO:

explore integration of an ingress controller so Kube-router will be one complete solution for both east-west and north-south traffic

Add release and commit tags to container images

cloudnativelabs/kube-router:latest tag should point to the latest release version.
cloudnativelabs/kube-router:master tag should point to the latest commit version.

All previous releases and commits should be available in the registry separately.

I will have to investigate how to do this automatically with every release/commit. I've done it before with quay.io registry but I haven't used Docker Hub yet.

panic on invalid net.beta.kubernetes.io/network-policy annotation

I set the net.beta.kubernetes.io/network-policy annotation to an invalid (json) value:

kubectl annotate ns test "net.beta.kubernetes.io/network-policy={\"ingress\": {\"isolation\": \"DefaultDeny\"}}"- --overwrite

This caused kube-router to panic and crashloop, but only on the nodes that had pods in the namespace the policy was applied to. Fixing the json in the annotation cleared things up.

I think I've found an unhandled Error that will lead me to a fix. I'll submit a PR soon if I find it.

Log:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x4eec82]

goroutine 205 [running]:
panic(0x1645840, 0xc420018050)
        /usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/cloudnativelabs/kube-router/app/controllers.(*NetworkPolicyController).syncPodFirewallChains(0xc42027a100, 0xc42089e7e0, 0x0, 0x0)
        /home/kube/go/src/github.com/cloudnativelabs/kube-router/app/controllers/network_policy_controller.go:323 +0x112
github.com/cloudnativelabs/kube-router/app/controllers.(*NetworkPolicyController).Sync(0xc42027a100)
        /home/kube/go/src/github.com/cloudnativelabs/kube-router/app/controllers/network_policy_controller.go:160 +0x2ac
github.com/cloudnativelabs/kube-router/app/controllers.(*NetworkPolicyController).OnPodUpdate(0xc42027a100, 0xc4203981f0)
        /home/kube/go/src/github.com/cloudnativelabs/kube-router/app/controllers/network_policy_controller.go:112 +0x17a
github.com/cloudnativelabs/kube-router/app/watchers.(*podWatcher).RegisterHandler.func1(0x1540e00, 0xc4203981f0)
        /home/kube/go/src/github.com/cloudnativelabs/kube-router/app/watchers/pods_watcher.go:65 +0x4a
github.com/cloudnativelabs/kube-router/utils.ListenerFunc.OnUpdate(0xc42066b800, 0x1540e00, 0xc4203981f0)
        /home/kube/go/src/github.com/cloudnativelabs/kube-router/utils/utils.go:14 +0x3a
created by github.com/cloudnativelabs/kube-router/utils.(*Broadcaster).Notify
        /home/kube/go/src/github.com/cloudnativelabs/kube-router/utils/utils.go:37 +0xa3

Expose node as prometheus endpoint

Lots of inforamtion is readily available (Information from conntrack, ipvs, iptables dropped packets etc) to Kube-router that can be exposed as prometheus metric.

Add capability to present metrics as prometheus endpoint.

AWS: node name mismatch preventing kube-proxy and kube-router to function properly

Both kops and bootkube seems to deploy cluster on AWS with --hostname-override flag for kubelet resulting in nodes to be registered to master with FQDN

± kubectl get nodes
NAME STATUS AGE VERSION
ip-172-20-51-2.us-west-2.compute.internal Ready,node 1h v1.6.2
ip-172-20-55-216.us-west-2.compute.internal Ready,node 1h v1.6.2
ip-172-20-61-204.us-west-2.compute.internal Ready,master 1h v1.6.2

Where as kube-proxy and kube-router just use hostname when retrieving node info from master. This mismatch results in failures of kube-proxy and kube-router.

W0528 22:32:42.648040 6 server.go:469] Failed to retrieve node info: Get https://api.internal.mycluster.aws.cloudnativelabs.net/api/v1/nodes/ip-172-20-51-2: dial tcp 203.0.113.123:443: i/o timeout

Atleast in kube-router could be changed to use safe option. First try with os.Hostname first then in case it fails try with full FQDN.

Support more than a single AS per cluster to allow AS per rack design

In specific, this project calico document shows a really common design that network engineers really like for these sorts of things. I'd like to use kube-router, but can't due to it not supporting this.

This involves each rack using a different AS

Explore using IPVS direct routing mode

Issue opened to track DR mode feature.

Initial research to put in this issue:

DR mode benefits over NAT mode
Challenges DR mode raises in different kubernetes environments

gobgp client in kube-router docker image does not show peer information

@thoro reported this issue. I tested it out and could reproduce it.

~ # gobgp neighbor
Peer    AS  Up/Down State       |#Received  Accepted
 64512 00:30:53 Establ      |        1         1
 64512 00:31:05 Establ      |        1         1
 64512 00:30:57 Establ      |        1         1

Problem seems to be with neighbor command only. Global rib shows up fine.

~ # gobgp global rib
   Network              Next Hop             AS_PATH                  Age        Attrs
*> 100.96.0.0/24        172.20.41.250        4000 400000 300000 40001 00:30:59   [{Origin: i} {LocalPref: 100}]
*> 100.96.1.0/24        172.20.47.91         4000 400000 300000 40001 00:31:11   [{Origin: i} {LocalPref: 100}]
*> 100.96.2.0/24        172.20.61.45         4000 400000 300000 40001 00:31:03   [{Origin: i} {LocalPref: 100}]
*> 100.96.3.0/24        172.20.33.233        4000 400000 300000 40001 00:00:13   [{Origin: i}]
~ #

From local gobgp client i am able too see the information fine.

/home/kube/go/bin/gobgp neighbor -u 52.36.44.2
Peer             AS  Up/Down State       |#Received  Accepted
172.20.33.233 64512 00:19:22 Establ      |        1         1
172.20.41.250 64512 00:19:24 Establ      |        1         1
172.20.47.91  64512 00:19:32 Establ      |        1         1
kube@kube-master:~$ /home/kube/go/bin/gobgp global rib  -u 52.36.44.2
    Network             Next Hop             AS_PATH                  Age        Attrs
*>  100.96.0.0/24       172.20.41.250        4000 400000 300000 40001 00:29:36   [{Origin: i} {LocalPref: 100}]
*>  100.96.1.0/24       172.20.47.91         4000 400000 300000 40001 00:29:44   [{Origin: i} {LocalPref: 100}]
*>  100.96.2.0/24       172.20.61.45         4000 400000 300000 40001 00:00:51   [{Origin: i}]
*>  100.96.3.0/24       172.20.33.233        4000 400000 300000 40001 00:29:34   [{Origin: i} {LocalPref: 100}]

Test and document Minikube integration

Minikube is great for rapid testing of software coupled to kubernetes like kube-router. It could be used for development (#27) and CI (#28) purposes.

Test running minikube with kube-router
Document minikube/kube-router integrated quickstart

integreate with travis CI

on pre-merge checks, do travis CI build, and run tests

also through travis we can build and push docker image. Can be used for #31

For e.g. something like https://github.com/coreos/flannel/blob/master/.travis.yml

pod outbound traffic blocked in KOPS provisioned cluster on AWS

iptable masqurade rule in POSTROUTING chain of NAT table is failing to get added with error

E0608 09:36:44.979263 1 network_routes_controller.go:91] Failed to add iptable rule to masqurade outbound traffic from pods due to exit status 2: iptables v1.6.0: invalid mask 10"' specified Try iptables -h' or 'iptables --help' for more information.

KOPS uses 100.64.0.0/10 subnet by default for pod cidr, below command is failing on default OS kops uses (debian jesse)
iptables -t nat -C POSTROUTING -s "100.64.0.0/10" ! -d "100.64.0.0/10" -j MASQUERADE --wait

support for network policy GA

very likely network policy is going to be GA in 1.7 [1]

there are some semantic changes not compatible with v1beta1 so existing implementation in kube-router does not work for GA/1.7 network policies. Refactor code so that it works for both GA and v1beta1 semantics of network policies

[1] kubernetes/kubernetes#39164

Masquerade external traffic based on the 'clusterCIDR'

Use 'clusterCIDR' similar to the flag Kube-proxy has to distinguish between internal and external traffic. For the external traffic hitting the node port, we need to ensure traffic goes through the node in reverse path (when using IPVS nat mode). Pod will try to send directly to the source if we dont maqurade the traffic.

Allow multiple peer routers

For redundancy purposes having the ability to specify multiple peer-router/peer-asn pairs would be useful.

Hairpin nat does not work

Accessing cluste IP from a pod, which is end point of the service fails, when traffic gets load balanced to same pod as the pod from which request originated.

We need a way to have hair pin nat , so that this scenario works.

Support Service type LoadBalancer

In cloud environments kube-router users should be able to define a LoadBalancer service that:

Creates an external load balancer
Creates an external VIP for said load balancer
Creates a Kubernetes Service to handle traffic from said load balancer and distributes that traffic between Service Endpoint pods.

Should be tested on AWS and GKE. h/t @drobinson123

IPVS services periodic sync panics due to modprobe failure

This is stack trace, need further investigation.

panic: Running modprobe ip_vs failed with message: ``, error: fork/exec /sbin/modprobe: too many open files

goroutine 95 [running]:
panic(0x15458e0, 0xc42118d990)
/usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/cloudnativelabs/kube-router/app/controllers.ipvsAddService(0xc42118c820, 0x10, 0x10, 0x101bb0006, 0xc42118d501, 0x4, 0x0)
/home/kube/go/src/github.com/cloudnativelabs/kube-router/app/controllers/network_services_controller.go:399 +0x964
github.com/cloudnativelabs/kube-router/app/controllers.(*NetworkServicesController).syncIpvsServices(0xc4203924d0, 0xc420a7ade0, 0xc420a7aed0)
/home/kube/go/src/github.com/cloudnativelabs/kube-router/app/controllers/network_services_controller.go:205 +0x3cb
github.com/cloudnativelabs/kube-router/app/controllers.(*NetworkServicesController).sync(0xc4203924d0)
/home/kube/go/src/github.com/cloudnativelabs/kube-router/app/controllers/network_services_controller.go:126 +0xbf
github.com/cloudnativelabs/kube-router/app/controllers.(*NetworkServicesController).Run(0xc4203924d0, 0xc42033a4e0, 0xc4202e27b0)
/home/kube/go/src/github.com/cloudnativelabs/kube-router/app/controllers/network_services_controller.go:106 +0x1d2
created by github.com/cloudnativelabs/kube-router/app.(*KubeRouter).Run
/home/kube/go/src/github.com/cloudnativelabs/kube-router/app/server.go:146 +0x4ec

panic with advertise cluster IP and headless service

I'm not at my computer so I can get logs of the panic later. I believe there's an issue when kube-router tries to advertise the IP of a headless service, which has no service IP but instead sets up DNS records going directly to the POD IPs for the service. The log mentioned advertising service IP [] (empty slice I believe), and then the panic was a memory access error, probably accessing a nonexistent element of the empty slice.

Reference: https://kubernetes.io/docs/concepts/services-networking/service/#headless-services

Provide a ServiceMonitor definition to support prometheus-operator users

For #66 - Phase 1

Provide a ServiceMonitor definition to support prometheus-operator users

Create a debugging toolbox container

For people like me that use CoreOS or other minimal/immutable operating systems, there are very few tools easily available for troubleshooting and advanced configuration. A common practice for Kubernetes applications is to make available a toolbox container that comes with software and configuration ready to perform these tasks.

For kube-router the toolbox should include:

gobgp client
ipvsadm
iproute2 suite
other? open for input

Configuration:

gobgp config automatically populated to talk to kube-router nodes
gobgp CLI completion configured
other? open for input

Incorrect generation of unique service key when building service and endpoint maps

Network service controller in Kube-router need to generate a unique service key ( combination of namespace, service name, and spec.ports.name) that will be used to index service info map, and end points info map.

Current key generation is flawed, where it fails if there is mismatch in port opened by the service and port opened by endpoint.

Correct way is to use spec.ports.name, which is internally copied by the API server into endpoint API object as well. This is what Kube-proxy uses as well.

Panic when kubelet has --hostname-override not set to FQDN

Here's the log:

ubuntu@osh-sh-ci-01:~$ kubectl logs  kube-router-hkf2d -n kube-system
panic: nodes "kubernetes" not found

goroutine 1 [running]:
panic(0x1596120, 0xc42040a450)
	/usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/cloudnativelabs/kube-router/app/controllers.NewNetworkPolicyController(0xc420315540, 0xc420314960, 0x0, 0x0, 0x0)
	/home/kube/go/src/github.com/cloudnativelabs/kube-router/app/controllers/network_policy_controller.go:785 +0x413
github.com/cloudnativelabs/kube-router/app.(*KubeRouter).Run(0xc4203c1660, 0xc4203c1660, 0x0)
	/home/kube/go/src/github.com/cloudnativelabs/kube-router/app/server.go:120 +0x710
main.main()
	/home/kube/go/src/github.com/cloudnativelabs/kube-router/kube-router.go:37 +0x13c
ubuntu@osh-sh-ci-01:~$

In this case kubelet's --hostname-override is set to the node's IP address. I believe that kube-router/app/controllers/network_policy_controller.go around L785 will only succeed if --hostname-override is omitted, or is set to the real FQDN on the host. It should probably not assume anything about the node's name, and pull the node name from the API server. I'll try to find how that's done and submit a PR if time permits. Thanks!

metrics: Feature planning, umbrella issue

I got excited about metrics since #65 landed, so this issue is just for mapping out short and long-term goals for kube-router metrics. Once we agree on the the abstract goals I will create issues to track their implementation. I will edit this issue with any changes we discuss.

Phase 1 (short term goals)

Configuration

Add a --metrics-port option to change the port.
- --metrics-port=0 means disabled. Default port should be uncommon.
Provide a k8s Service definition for kube-router metrics port to our example manifests
Provide a ServiceMonitor definition to support prometheus-operator users

Phase 2 (Expanded Metrics)

Additional Metrics (Basic)

BGP basics. See: gobgp monitor
- Peering/Neighbor status changes
- RIB changes
kube-router app metrics
- error counter
- iptable rules counter

Phase 3

Additional Metrics (Advanced/Debugging)

Could use or draw inspiration from k8sconntrack

Low level conntrack
TCP/UDP
iptable stats
other /proc metrics within the scope of Nodes, Services, Endpoints and Pod networking

Features and Configuration

Create a Grafana dashboard to quickly visualize metrics
- Related to network visualization proposal in #12
Support annotations to enable metrics per-service
- This is to support adding CPU intensive metrics and debugging scenarios, to be implemented in another issue.

bump up the client-go version used by kube-router in order for supporting 1.7GA

There does not seems to be a release version of client-go that supports Kubernetes 1.7 so far. So test out the latest 3.0 beta with Kube-router and if it works with out any hiccups, vendor the latest client-go.

this is needed for #16

setup masquerade rule for traffic destined for outside of cluster and pod network

We need a masquerade rule so that pod can reach to external network. A pod can reach any pod in the cluster, cluster IP's and nodes. Any traffic not destined for any of (other pod, cluster ip, node) then masquerade traffic.

provide service visualization

If we have to track connections for #10 and #5 anyway, we very well expose http endpoint on Kube-router to visuaize dynamic state of services. Kube-router runs a daemonset so we have pod running on the node, we can expose a service as node port type that can give a light weight service visualization.

Support for external BGP peer info to be passed to kube-router

Add support for configuring a BGP peer information. Kube-router on each node will peer with provide external peer. Once peered, external peer will know how to route traffic to the pods with in the cluster.

This will enable use-case where external access to the pods are required.

Also cluster IP for the servces can be advertised, so external access to the services through the cluster IP can be achived.

Nodeport service does not get created in IPVS

Though logs indicate service is created Successfully added service: 172.20.59.228:tcp:31044, actual service is not created

ERROR: logging before flag.Parse: I0713 05:35:46.085555       1 network_services_controller.go:109] Performing periodic syn of the ipvs services and server to reflect desired state of kubernetes services and endpoints
ERROR: logging before flag.Parse: I0713 05:35:46.085621       1 network_services_controller.go:419] No hairpin-mode enabled services found -- no hairpin rules created
ERROR: logging before flag.Parse: I0713 05:35:46.088221       1 network_services_controller.go:639] Successfully added service: 100.67.46.129:tcp:80
ERROR: logging before flag.Parse: I0713 05:35:46.088733       1 network_services_controller.go:639] Successfully added service: 172.20.59.228:tcp:31044
ERROR: logging before flag.Parse: I0713 05:35:46.088933       1 network_services_controller.go:648] Successfully added destination 100.96.1.10:80 to the service 100.67.46.129:tcp:80
ERROR: logging before flag.Parse: I0713 05:35:46.089081       1 network_services_controller.go:648] Successfully added destination 100.96.1.10:80 to the service 172.20.59.228:tcp:31044
ERROR: logging before flag.Parse: I0713 05:35:46.089267       1 network_services_controller.go:648] Successfully added destination 100.96.1.8:80 to the service 100.67.46.129:tcp:80
ERROR: logging before flag.Parse: I0713 05:35:46.089410       1 network_services_controller.go:648] Successfully added destination 100.96.1.8:80 to the service 172.20.59.228:tcp:31044
ERROR: logging before flag.Parse: I0713 05:35:46.089613       1 network_services_controller.go:648] Successfully added destination 100.96.1.9:80 to the service 100.67.46.129:tcp:80
ERROR: logging before flag.Parse: I0713 05:35:46.089741       1 network_services_controller.go:648] Successfully added destination 100.96.1.9:80 to the service 172.20.59.228:tcp:31044
ERROR: logging before flag.Parse: I0713 05:35:46.090123       1 network_services_controller.go:639] Successfully added service: 100.64.0.1:tcp:443
ERROR: logging before flag.Parse: I0713 05:35:46.090316       1 network_services_controller.go:648] Successfully added destination 172.20.33.155:443 to the service 100.64.0.1:tcp:443
ERROR: logging before flag.Parse: I0713 05:35:46.090752       1 network_services_controller.go:639] Successfully added service: 100.64.0.10:udp:53
ERROR: logging before flag.Parse: I0713 05:35:46.090998       1 network_services_controller.go:648] Successfully added destination 100.96.1.2:53 to the service 100.64.0.10:udp:53
ERROR: logging before flag.Parse: I0713 05:35:46.091167       1 network_services_controller.go:648] Successfully added destination 100.96.1.4:53 to the service 100.64.0.10:udp:53
ERROR: logging before flag.Parse: I0713 05:35:46.094938       1 network_services_controller.go:639] Successfully added service: 100.64.0.10:tcp:53
ERROR: logging before flag.Parse: I0713 05:35:46.095102       1 network_services_controller.go:648] Successfully added destination 100.96.1.2:53 to the service 100.64.0.10:tcp:53
ERROR: logging before flag.Parse: I0713 05:35:46.095279       1 network_services_controller.go:648] Successfully added destination 100.96.1.4:53 to the service 100.64.0.10:tcp:53
ERROR: logging before flag.Parse: I0713 05:35:46.095638       1 network_services_controller.go:639] Successfully added service: 100.70.67.153:tcp:6379
ERROR: logging before flag.Parse: I0713 05:35:46.095781       1 network_services_controller.go:648] Successfully added destination 100.96.1.5:6379 to the service 100.70.67.153:tcp:6379
ERROR: logging before flag.Parse: I0713 05:35:46.096280       1 network_services_controller.go:639] Successfully added service: 100.66.191.116:tcp:6379
ERROR: logging before flag.Parse: I0713 05:35:46.096445       1 network_services_controller.go:648] Successfully added destination 100.96.1.6:6379 to the service 100.66.191.116:tcp:6379
ERROR: logging before flag.Parse: I0713 05:35:46.096632       1 network_services_controller.go:648] Successfully added destination 100.96.1.7:6379 to the service 100.66.191.116:tcp:6379

~ # ipvsadm
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  100.64.0.1:https rr persistent 10800 mask 0.0.0.0
  -> ip-172-20-33-155.us-west-2.c Masq    1      3          0
TCP  100.64.0.10:domain rr
  -> 100.96.1.2:domain            Masq    1      0          0
  -> 100.96.1.4:domain            Masq    1      0          0
TCP  100.66.191.116:6379 rr
  -> 100.96.1.6:6379              Masq    1      0          0
  -> 100.96.1.7:6379              Masq    1      0          0
TCP  100.67.46.129:http rr
  -> 100.96.1.8:http              Masq    1      0          0
  -> 100.96.1.9:http              Masq    1      0          0
  -> 100.96.1.10:http             Masq    1      0          0
TCP  100.70.67.153:6379 rr
  -> 100.96.1.5:6379              Masq    1      2          2
TCP  ip-172-20-59-228.us-west-2.c rr
  -> 100.96.1.8:http              Masq    1      0          0
  -> 100.96.1.9:http              Masq    1      0          0
  -> 100.96.1.10:http             Masq    1      0          0
UDP  100.64.0.10:domain rr
  -> 100.96.1.2:domain            Masq    1      0          1
  -> 100.96.1.4:domain            Masq    1      0          1

Also logs shows ERROR for INFO

sub-optimal routes on BGP global peer

Thanks @rmb938 @thoro for reporting this issue over gitter.

When nodes in cluster running iBGP advertises routes to the global peer, next hop to subnet is not the node IP corresponding to the subnet. But they all point to single node

For e.g, in 3 node cluster with IP 192.168.1.100, 192.168.1.101, 192.168.1.102 globally peering with 192.168.1.98

root@kube-master:~# gobgp neighbor -u 192.168.1.100
Peer             AS  Up/Down State       |#Received  Accepted
192.168.1.98  64513 00:00:19 Establ      |        3         0
192.168.1.101 64512 00:00:21 Establ      |        1         1
192.168.1.102 64512 00:00:25 Establ      |        1         1
root@kube-master:~# gobgp neighbor -u 192.168.1.102
Peer             AS  Up/Down State       |#Received  Accepted
192.168.1.98  64513 00:00:37 Establ      |        1         0
192.168.1.100 64512 00:00:29 Establ      |        1         1
192.168.1.101 64512 00:00:21 Establ      |        1         1
root@kube-master:~# gobgp neighbor -u 192.168.1.101
Peer             AS  Up/Down State       |#Received  Accepted
192.168.1.98  64513 00:00:37 Establ      |        2         0
192.168.1.100 64512 00:00:26 Establ      |        1         1
192.168.1.102 64512 00:00:22 Establ      |        1         1
root@kube-master:~# gobgp global rib  -u 192.168.1.100
    Network             Next Hop             AS_PATH              Age        Attrs
*>  10.1.0.0/24         192.168.1.100                             00:00:00   [{Origin: i}]
*>  10.1.1.0/24         192.168.1.101                             00:00:47   [{Origin: i} {LocalPref: 100}]
*>  10.1.2.0/24         192.168.1.102                             00:00:51   [{Origin: i} {LocalPref: 100}]
root@kube-master:~# gobgp global rib  -u 192.168.1.101
    Network             Next Hop             AS_PATH              Age        Attrs
*>  10.1.0.0/24         192.168.1.100                             00:00:49   [{Origin: i} {LocalPref: 100}]
*>  10.1.1.0/24         192.168.1.101                             00:00:03   [{Origin: i}]
*>  10.1.2.0/24         192.168.1.102                             00:00:45   [{Origin: i} {LocalPref: 100}]
root@kube-master:~# gobgp global rib  -u 192.168.1.102
    Network             Next Hop             AS_PATH              Age        Attrs
*>  10.1.0.0/24         192.168.1.100                             00:00:54   [{Origin: i} {LocalPref: 100}]
*>  10.1.1.0/24         192.168.1.101                             00:00:46   [{Origin: i} {LocalPref: 100}]
*>  10.1.2.0/24         192.168.1.102                             00:00:05   [{Origin: i}]

routes all point next hop as 192.168.1.102 on BGP global peer

root@router:~# ip route
default via 192.168.1.1 dev ens33 onlink
10.1.0.0/24 via 192.168.1.102 dev ens33  proto zebra
10.1.1.0/24 via 192.168.1.102 dev ens33  proto zebra
10.1.2.0/24 via 192.168.1.102 dev ens33  proto zebra
192.168.1.0/24 dev ens33  proto kernel  scope link  src 192.168.1.98

root@router:~# ip route
default via 192.168.1.1 dev ens33 onlink
10.1.0.0/24 via 192.168.1.102 dev ens33  proto zebra
10.1.1.0/24 via 192.168.1.101 dev ens33  proto zebra
10.1.2.0/24 via 192.168.1.102 dev ens33  proto zebra
192.168.1.0/24 dev ens33  proto kernel  scope link  src 192.168.1.98
root@router:~# ping 10.1.0.70
PING 10.1.0.70 (10.1.0.70) 56(84) bytes of data.
64 bytes from 10.1.0.70: icmp_seq=1 ttl=63 time=2.46 ms
From 192.168.1.102: icmp_seq=2 Redirect Host(New nexthop: 192.168.1.100)
64 bytes from 10.1.0.70: icmp_seq=2 ttl=63 time=0.589 ms
From 192.168.1.102: icmp_seq=3 Redirect Host(New nexthop: 192.168.1.100)
64 bytes from 10.1.0.70: icmp_seq=3 ttl=63 time=0.958 ms
root@router:~# traceroute 10.1.0.70
traceroute to 10.1.0.70 (10.1.0.70), 30 hops max, 60 byte packets
 1  192.168.1.102 (192.168.1.102)  0.307 ms  0.298 ms  0.308 ms
 2  192.168.1.100 (192.168.1.100)  0.648 ms  0.573 ms  0.537 ms
 3  10.1.0.70 (10.1.0.70)  3.238 ms  3.847 ms  3.775 ms

Test and document kubeadm integration

Kubeadm is one of the most popular and officially supported methods of deploying Kubernetes. Need to test and document it for those users.

User @rmb938 has provided initial RBAC related definitions for this environment.