nats-io / nats-operator Goto Github PK

View Code? Open in Web Editor NEW

573.0 15.0 111.0 12.02 MB

NATS Operator

Home Page: https://nats-io.github.io/k8s/

License: Apache License 2.0

Go 96.74% Shell 2.26% Dockerfile 0.38% Makefile 0.62%

nats kubernetes operator cluster message-queue pubsub

nats-operator's Issues

NATS Streaming support

nats-streaming is getting clustering support and so it's about time to support it.

Help with generating SSL Certificates

Any proper steps for generating ca.pem, route-key.pem, route.pem, server-key.pem, and server.pem files

When is the next planned release?

Move from TPR to CRD

Read https://coreos.com/blog/custom-resource-kubernetes-v17.

Exposing the NATS server with an External IP

In my use case, subscribers can come from outside the K8s cluster, and thus I need an external IP for the messaging port. What would be the right design to accomplish this in the chart as a configurable parameter?

Helm chart: support permissions configuration

This is an issue for the future me when the PR #62 is merged. Based on this discussion, we want to be able to configure NATS like this:

{
  "users": [
    { "username": "user1", "password": "secret1" },
    { "username": "user2", "password": "secret2",
      "permissions": {
	"publish": ["hello.*"],
	"subscribe": ["hello.world"]
      }
    }
  ],
  "default_permissions": {
    "publish": ["SANDBOX.*"],
    "subscribe": ["PUBLIC.>"]
  }
}

As of now we only support multiple users creation.

Support NATS Streaming

At first let me say that this is super cool, love the concept of operators & thanks for doing this with NATS.

I was was wondering (and would like to see) if there are any plans to support NATS Streaming?

Add Cluster/Client Advertise to cluster config

Currently when TLS is setup for clients and routes, the NATS clients receive routes as IPs via auto discovery so hostname verification would fail when reconnecting to one of these IPs. Setting instead the pods to advertise the A record that they get in the cluster would allow clients to be able to failover to an available node right away.

https://github.com/nats-io/gnatsd/blob/163ba3f6a7521300a4f578eb90b9ff5d40d658c4/main.go#L24

https://github.com/nats-io/gnatsd/blob/163ba3f6a7521300a4f578eb90b9ff5d40d658c4/main.go#L51

Deploy as DaemonSet so that a Nats Client Runs on Each Node

Is there a way to deploy the operator as a daemon set or using Anti Affinity so that you can place each server one or two to a node so that clients will connect to the node running on their machine? Thanks!

prometheus support in nats-operator

When nats-io/prometheus-nats-exporter#48 is completed, it would be great if the option to add prometheus exporter was added to nats-operator.

TPR name unclear

I'm very excited about this operator and giving it a try. In https://github.com/pires/nats-operator/blob/master/pkg/controller/controller.go#L38 the TPR name is set to management.nats.io which also seems to be the TPR that's created when I run the operator.

The example however defines the kind as NatsCluster https://github.com/pires/nats-operator/blob/master/example/example-nats-cluster.yaml

However there is no natscluster TRP.

What am I missing?

No clusters created after CRD creation. RBAC enabled.

Hey,

I'm using local cluster based on https://github.com/Mirantis/kubeadm-dind-cluster in 1.11 version with 3 nodes:

NAME          STATUS    ROLES     AGE       VERSION  
kube-master   Ready     master    30m       v1.11.0
kube-node-1   Ready     <none>    29m       v1.11.0
kube-node-2   Ready     <none>    29m       v1.11.0
kube-node-3   Ready     <none>    29m       v1.11.0

RBAC is enabled so therefore I executed:
kubectl apply -f https://raw.githubusercontent.com/nats-io/nats-operator/master/example/deployment-rbac.yaml
which ended up with:

$ kubectl -n nats-io logs deployment/nats-operator
time="2018-08-24T11:20:15Z" level=info msg="nats-operator Version: 0.2.3-v1alpha2+git"
time="2018-08-24T11:20:15Z" level=info msg="Git SHA: d88048a"
time="2018-08-24T11:20:15Z" level=info msg="Go Version: go1.9"
time="2018-08-24T11:20:15Z" level=info msg="Go OS/Arch: linux/amd64"
time="2018-08-24T11:20:33Z" level=info msg="Event(v1.ObjectReference{Kind:\"Endpoints\", Namespace:\"nats-io\", Name:\"nats-operator\", UID:\"fd436814-a78d-11e8-9920-e20744c33dad\", APIVersion:\"v1\", ResourceVersion:\"3701\", FieldPath:\"\"}): type: 'Normal' reason: 'LeaderElection' nats-operator-f44c5854d-ph6cg became leader"
time="2018-08-24T11:20:33Z" level=info msg="finding existing clusters..." pkg=controller
time="2018-08-24T11:20:33Z" level=info msg="starts running from watch version: 3702" pkg=controller
time="2018-08-24T11:20:33Z" level=info msg="start watching at 3702" pkg=controller
time="2018-08-24T11:22:16Z" level=info msg="apiserver closed watch stream, retrying after 5s..." pkg=controller
time="2018-08-24T11:22:21Z" level=info msg="start watching at 3702" pkg=controller
time="2018-08-24T11:23:52Z" level=info msg="apiserver closed watch stream, retrying after 5s..." pkg=controller
time="2018-08-24T11:23:57Z" level=info msg="start watching at 3702" pkg=controller

so far so good BUT then

$ kubectl apply -f https://raw.githubusercontent.com/nats-io/nats-operator/master/example/example-nats-cluster.yaml
natscluster.nats.io/example-nats-1 created```
and then:
```$ kubectl get natsclusters.nats.io
NAME             AGE                                                         
example-nats-1   2s

but sadly

$ kubectl -n nats-io get pods -l nats_cluster=example-nats-1
No resources found.

There are no logs in operator no other pods what so ever. I tried to check what it actually tries to do but it just seems that it's stalled. I also recreated deployment and the same.

Operator exits on disconnect from API server

We should change it so that it retries and reconnect without causing it to restart

nats-operator/cmd/operator/main.go

Lines 147 to 160 in 641d0b5

 leaderelection.RunOrDie(context.TODO(), leaderelection.LeaderElectionConfig{ 

 Lock: rl, 

 LeaseDuration: 15 * time.Second, 

 RenewDeadline: 10 * time.Second, 

 RetryPeriod: 2 * time.Second, 

 Callbacks: leaderelection.LeaderCallbacks{ 

 OnStartedLeading: func(ctx context.Context) { 

 run(ctx, kubeCfg, kubeClient) 

 }, 

 OnStoppedLeading: func() { 

 logrus.Fatalf("leader election lost") 

 }, 

 }, 

 })

Helm chart

First of all, thank you for your amazing work on Nats and particularly nats-operator.

What do you think of creating a Helm chart for nats-operator? That would be nice I think because we could do something like this:

helm install nats-operator

I have some knowledge in creating helm charts but my global understanding of nats is not so great so I can help creating a base chart for a minimal setup. Edge cases and more specific uses of nats could be supported later. What do you think?

How to update server config?

Hello,

I've been trying to figure out how to set a custom configuration for the NATS Cluster, and I couldn't find anything in the docs or looking for uses of ServerConfig in the code.

Any leads or clues on how to set it?
My specific need is to find out how to increase max_payload_size.

Thanks!

Support installing the CRD from k8s manifests instead of from the operator

Instead of creating the CRDs from the code, I think it would make sense to create it from k8s manifests while installing the operator itself. This could be done for the helm chart or for all installs.

This would allow:

using the crd-install helm hook in helm charts
getting rid of a rule in Role definitions when using rbac required to install a CRD:

- apiGroups:
  - apiextensions.k8s.io
  resources:
  - customresourcedefinitions
  verbs: ["*"]

Implementation:

support a command line flag to toggle crd install from the operator
in the helm chart, disable crd installs from the operator
in the helm chart, remove the rule for CRDs CRUD

Support customizing Logging options

Add debug/trace to the NatsCluster spec
https://github.com/nats-io/nats-operator/blob/master/pkg/conf/natsconf.go#L16-L17

Unable to recover from node crash

I was testing this locally until the node crashed (minikube). Now I get this constantly:

time="2018-12-05T16:34:51Z" level=info msg="Cluster size needs reconciling: expected 3, has 0" cluster-name=nats pkg=cluster
time="2018-12-05T16:34:51Z" level=error msg="failed to reconcile: pods \"nats-1\" already exists" cluster-name=nats pkg=cluster

The describe on nats-1 (it's permanently terminated:

Name:           nats-1
Namespace:      nats-io
Node:           minikube/10.0.2.15
Start Time:     Wed, 05 Dec 2018 11:46:42 +0100
Labels:         app=nats
                nats_cluster=nats
                nats_version=1.3.0
Annotations:    nats.version=1.3.0
Status:         Failed
IP:             
Controlled By:  NatsCluster/nats
Containers:
  nats:
    Container ID:  docker://b0892b87b0ac1d25ff64e197a53e138b3e635f76f74589924d16eafa3a55d50e
    Image:         nats:1.3.0
    Image ID:      docker-pullable://nats@sha256:5e99caf7ca7b2e4a242e741328bde393bbd7a529a2cfdd19b84870da87ad6ca1
    Ports:         6222/TCP, 4222/TCP, 8222/TCP
    Command:
      /gnatsd
      -c
      /etc/nats-config/nats.conf
      -P
      /var/run/nats/gnatsd.pid
    State:          Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Wed, 05 Dec 2018 11:46:49 +0100
      Finished:     Wed, 05 Dec 2018 17:22:54 +0100
    Ready:          False
    Restart Count:  0
    Liveness:       http-get http://:8222/ delay=10s timeout=10s period=60s #success=1 #failure=3
    Environment:
      SVC:    nats-mgmt
      EXTRA:  --http_port=8222
    Mounts:
      /etc/nats-config from nats-config (rw)
      /var/run/nats from pid (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-pzlst (ro)
Conditions:
  Type           Status
  Initialized    True 
  Ready          False 
  PodScheduled   True 
Volumes:
  nats-config:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  nats
    Optional:    false
  pid:
    Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:  
  default-token-pzlst:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-pzlst
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none>

Support Cluster-Wide Operator

Would love for the nats-operator to function in a similar way to istio where there is a management namespace and the operator can watch all namespaces (w/ opt-out). Our main use can for this is we leverage namespaces at the environment level.

My ideal setup with where resources are defined would be as follows:

Cluster

ClusterRole
CRDs

ns: nats-system

ServiceAccount
Deployment
ClusterRoleBinding

ns: dev

NatsCluster (size: 1)

ns: staging

NatsCluster (size: 3)

ns: production

NatsCluster (size: 3)

Missing client authentication enabled check in natscluster.yaml helm template

Currently the auth section in natscluster.yaml does not check if client authentication is set to enabled in values.yaml.

It is missing {{- if .Values.auth.enabled }}

  {{- if .Values.auth.enabled }}
  auth:
    enableServiceAccounts: {{ .Values.auth.enableServiceAccounts }}
    clientsAuthSecret: {{ template "nats.fullname" . }}-clients-auth
    clientsAuthTimeout: 5
  {{- end }}

Failed to re-create cluster

I created a sample cluster with

kc apply -f example-nats-cluster.yaml -n nats-io

After playing with the cluster a while, I did

kc delete -f example-nats-cluster.yaml -n nats-io

and then

kc apply -f example-nats-cluster.yaml -n nats-io

again. But it failed this time, describe shows the following reason:

cluster create: fail to create shared config map: configmaps "example-nats-1" already exists

Official docker images

Would be great to provide official or nats supported docker images to use for the nats-operator and reloader.

can't lock NatsCluster to certain nodes

I am trying to have all the nats related pods on particular nodes. I have everything but the NatsCluster pods working

---
apiVersion: "nats.io/v1alpha2"
kind: "NatsCluster"
metadata:
  name: "my-nats"
spec:
  size: 3
  template:
    spec:
      nodeSelector:
        my.pool: nats

I have the nodeselector in place for NatsStreamingCluster and it works.
But for NatsCluster, it puts the pods randomly on different nodes.

Readiness probe never passes when updating nats-operator deployment

When making changes to nats-operator deployment, the pod in the new replicaset from updated deployment never passes readiness probe until previous nats-operator pod is manually deleted.

Example

For example, I apply the updated deployment (with a newer nats-operator image version):

kubectl apply -f ./nats-operator/ --namespace my-namespace

I observe a new pod is created:

kubectl get pods -l app=nats-operator --namespace=my-namespace
NAME                             READY   STATUS    RESTARTS   AGE
nats-operator-68856b7bb6-tlghf     1/1     Running   0          4h17m
nats-operator-68856b7bb6-ntpwj     0/1     Running   0          15m

But the pod is failing the readiness probe:

kubectl describe pod nats-operator-68856b7bb6-ntpwj --namespace=my-namespace

Name:           nats-operator-68856b7bb6-ntpwj
Namespace:      my-namespace
…
Conditions:
  Type           Status
  Initialized    True
  Ready          False
  PodScheduled   True
…
Warning  Unhealthy              3m29s (x149 over 28m)  kubelet, gke-my-cluster-default-pool-xxx  Readiness probe failed: HTTP probe failed with statuscode: 500

If I delete the old pod:

kubectl delete pod nats-operator-5fc7849b6-tlghf --namespace=my-namespace

The new one becomes Ready:

kubectl get pods -l app=nats-operator --namespace=my-namespace
NAME                             READY   STATUS    RESTARTS   AGE
nats-operator-68856b7bb6-ntpwj   1/1     Running   0          33m

Potential Solution

Change nats-operator deployment to use Recreate strategy (default is a Rolling Update) so the old pod is deleted first:
https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#strategy

But this would mean nats-operator might not exist for a short period, is that actually a problem? The NATS cluster should still be in place and functional during that time.

Unable to scale or upgrade cluster

Caused by kubernetes/kubernetes#35993.

TLS Route Certificates need IP SANs when Deployed to same IP

I know this is not a NATS issue, but I wanted to point out an issue we are experiencing due to the nature of k8s deployments and the way the operator is currently configured.

Also, I'm not sure if this is related to #32

I'm using a wildcard CN *.nats-mgmt.nats-io.svc for the routes and when a cluster node is deployed with the same IP Address as another node, I receive the following error:

[1] 2018/10/12 18:46:42.988808 [ERR] 172.28.219.187:6222 - rid:4 - TLS route handshake error: x509: cannot validate certificate for 172.28.219.187 because it doesn't contain any IP SANs

I could add IP SANs to my certificate, but that seems crazy since the IP is unpredictable (and IP SANs can't use wildcards AFAIK).

Is there a strategy I'm missing? Perhaps we need an option to skip hostname verification? Or better yet, force deployment to different IPs?

can't set NatsCluster pods to specific nodes

I am trying to have all the nats related pods on particular nodes. I have everything but the NatsCluster pods working

---
apiVersion: "nats.io/v1alpha2"
kind: "NatsCluster"
metadata:
  name: "my-nats"
spec:
  size: 3
  template:
    spec:
      nodeSelector:
        my.pool: nats

I have the nodeselector in place for NatsStreamingCluster and it works.
But for NatsCluster, it puts the pods randomly on different nodes.

Deploy in namespace without using a ClusterRole

I want nats operator to be fully self-contained in a namespace, without requiring cluster level permissions. To that end I modified the yamls to use Role instead of ClusterRole. But I get this error from nats operator:

time="2018-11-30T18:40:50Z" level=error msg="initialization failed: fail to create CRD: customresourcedefinitions.apiextensions.k8s.io is forbidden: User \"system:serviceaccount:molly-dev:nats-operator\" cannot create resource \"customresourcedefinitions\" in API group \"apiextensions.k8s.io\" at the cluster scope" pkg=controller

Why does nats-operator need to create customresourcedefinitions?

I've been googling for the issue, and the closes I can find that is relevant is the etcd-operator: https://github.com/coreos/etcd-operator/blob/master/doc/user/rbac.md#role-vs-clusterrole

--create-crd=false Creates a CR without first creating a CRD.
In this mode the operator can be run with just a Role without the permission to create a CRD.

Maybe nats operator needs a similar option?

Here's my full yaml file:

---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: nats-operator
rules:
# Allow creating CRDs
- apiGroups:
  - apiextensions.k8s.io
  resources:
  - customresourcedefinitions
  verbs: ["*"]
# Allow all actions on NatsClusters
- apiGroups:
  - nats.io
  resources:
  - natsclusters
  - natsserviceroles
  verbs: ["*"]
# Allow actions on basic Kubernetes objects
- apiGroups: [""]
  resources:
  - configmaps
  - secrets
  - pods
  - services
  - serviceaccounts
  - serviceaccounts/token
  - endpoints
  - events
  verbs: ["*"]

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: nats-operator

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: nats-operator-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: nats-operator
subjects:
- kind: ServiceAccount
  name: nats-operator

---
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: natsclusters.nats.io
spec:
  group: nats.io
  names:
    kind: NatsCluster
    listKind: NatsClusterList
    plural: natsclusters
    singular: natscluster
    shortNames:
    - nats
  scope: Namespaced
  version: v1alpha2

---
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: natsserviceroles.nats.io
spec:
  group: nats.io
  names:
    kind: NatsServiceRole
    listKind: NatsServiceRoleList
    plural: natsserviceroles
    singular: natsservicerole
  scope: Namespaced
  version: v1alpha2

---
apiVersion: apps/v1beta2
kind: Deployment
metadata:
  name: nats-operator
spec:
  replicas: 1
  selector:
    matchLabels:
      name: nats-operator
  template:
    metadata:
      labels:
        name: nats-operator
    spec:
      serviceAccountName: nats-operator
      containers:
      - name: nats-operator
        image: connecteverything/nats-operator:0.3.0-v1alpha2
        imagePullPolicy: Always
        ports:
        - name: readyz
          containerPort: 8080
        env:
        - name: MY_POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: MY_POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        readinessProbe:
          httpGet:
            path: /readyz
            port: readyz
          initialDelaySeconds: 15
          timeoutSeconds: 3

Add development documentation

AKS TLS

Has anyone gotten the nats-operator running in AKS using TLS?

Documentation used to generate .pem files for cert / server-cert / ca:
https://docs.docker.com/engine/security/https/

Nats:
apiVersion: "nats.io/v1alpha2"
kind: "NatsCluster"
metadata:
name: "nats-example"
spec:
size: 3

auth:
# Definition in JSON of the users permissions
clientsAuthSecret: "nats-clients-auth"

tls:
# Certificates to secure the NATS client connections:
serverSecret: "nats-clients-tls"

The pod starts:
[1] 2019/02/01 01:30:10.672066 [INF] Starting nats-server version 1.3.0
[1] 2019/02/01 01:30:10.672096 [INF] Git commit [eed4fbc]
[1] 2019/02/01 01:30:10.672214 [INF] Starting http monitor on 0.0.0.0:8222
[1] 2019/02/01 01:30:10.672239 [INF] Listening for client connections on 0.0.0.0:4222
[1] 2019/02/01 01:30:10.672266 [INF] TLS required for client connections
[1] 2019/02/01 01:30:10.672269 [INF] Server is ready
[1] 2019/02/01 01:30:10.672400 [INF] Listening for route connections on 0.0.0.0:6222
[1] 2019/02/01 01:30:10.678703 [INF] 10.244.1.22:6222 - rid:1 - Route connection created
[1] 2019/02/01 01:30:10.678811 [INF] 10.244.2.19:6222 - rid:2 - Route connection created
[1] 2019/02/01 01:30:10.680400 [INF] 10.244.2.19:54850 - rid:3 - Route connection created

But then immediately start receiving these errors:
[1] 2019/02/01 01:31:28.071952 [ERR] 10.240.0.4:56227 - cid:31 - TLS handshake error: read tcp 10.244.2.19:4222->10.240.0.4:56227: i/o timeout
[1] 2019/02/01 01:31:28.072088 [ERR] 10.240.0.4:56227 - cid:31 - TLS handshake timeout

RBAC support

Currently it is not possible to run provided resources when RBAC is enabled.

Allow custom annotations for promethues metrics

I am using prometheus.app annotations to scrape metrics. Can we have option for custom annotations?

nats-operator/pkg/util/kubernetes/kubernetes.go

Line 668 in d6d163e

// Add pod annotations for promethues metrics

Remove adhoc natsconf config generation

In latest release of the server v1.1.0, JSON would be supported for configuring the server: nats-io/nats-server#653
This would mean that can remove the current config generation and just convert to JSON:

nats-operator/pkg/conf/natsconf.go

Lines 52 to 73 in bb45bba

 func Marshal(conf *ServerConfig) ([]byte, error) { 

 js, err := json.MarshalIndent(conf, "", " ") 

 if err != nil { 

 return nil, err 

 } 

 if len(js) < 1 || len(js)-1 <= 1 { 

 return nil, ErrInvalidConfig 

 } 

 // Slice the initial and final brackets from the 

 // resulting JSON configuration so gnatsd config parsers 

 // almost treats it as valid config. 

 js = js[1:] 

 js = js[:len(js)-1] 

 // Replacing all commas with line breaks still keeps 

 // arrays valid and makes the top level configuration 

 // be able to be parsed as gnatsd config. 

 result := bytes.Replace(js, []byte(","), []byte("\n"), -1) 

 return result, nil 

 }

Allow setting custom image for the NATS server

Should allow making it possible to set a custom image for the server, for example:

apiVersion: "nats.io/v1alpha2"
kind: "NatsCluster"
metadata:
  name: "example-nats-1"
spec:
  size: 3
  # version "1.3.0"
  image: "nats:1.3.0"

error viewing pod in gui

Everything works just fine but if i try to look at a pod from the gui i get this:

RBAC is enabled and i´ve installed the rbac-yaml.

Support config reload

When certain updates are applied to the configmap, it would be convenient to trigger a reload to the NATS server so that new config is applied.

Enable customizing Authorization config

Add support to configure auth via the NatsCluster spec: https://github.com/nats-io/nats-operator/blob/master/pkg/conf/natsconf.go#L21

Add usage documentation

TLS support

Much like coreos/etcd-operator#409.

Authentication support

logs being spammed when a pod is deleted

I'm testing the resilience of an app of mine that uses NATS and then I've been doing some tests removing pods from the cluster and then waiting for new ones to surge, they do surge and everything seems to be working fine, however, logs like that are being spammed:

nats-4xwt4vd2gl nats [1] 2018/04/20 19:40:27.277348 [ERR] Error trying to connect to route: dial tcp: lookup nats-glbxsvxbb3.nats-cluster-1-mgmt.nats-io.svc on 100.64.0.10:53: no such host
nats-4xwt4vd2gl nats [1] 2018/04/20 19:40:27.360110 [ERR] Error trying to connect to route: dial tcp: lookup nats-jftp8nrf0r.nats-cluster-1-mgmt.nats-io.svc on 100.64.0.10:53: no such host
nats-xnqsk7ncm4 nats [1] 2018/04/20 19:40:27.690995 [ERR] Error trying to connect to route: dial tcp: lookup nats-jftp8nrf0r.nats-cluster-1-mgmt.nats-io.svc on 100.64.0.10:53: no such host
nats-4xwt4vd2gl nats [1] 2018/04/20 19:40:28.285597 [ERR] Error trying to connect to route: dial tcp: lookup nats-glbxsvxbb3.nats-cluster-1-mgmt.nats-io.svc on 100.64.0.10:53: no such host
nats-4xwt4vd2gl nats [1] 2018/04/20 19:40:28.364757 [ERR] Error trying to connect to route: dial tcp: lookup nats-jftp8nrf0r.nats-cluster-1-mgmt.nats-io.svc on 100.64.0.10:53: no such host
nats-xnqsk7ncm4 nats [1] 2018/04/20 19:40:28.699119 [ERR] Error trying to connect to route: dial tcp: lookup nats-jftp8nrf0r.nats-cluster-1-mgmt.nats-io.svc on 100.64.0.10:53: no such host
nats-4xwt4vd2gl nats [1] 2018/04/20 19:40:29.292426 [ERR] Error trying to connect to route: dial tcp: lookup nats-glbxsvxbb3.nats-cluster-1-mgmt.nats-io.svc on 100.64.0.10:53: no such host
nats-4xwt4vd2gl nats [1] 2018/04/20 19:40:29.376717 [ERR] Error trying to connect to route: dial tcp: lookup nats-jftp8nrf0r.nats-cluster-1-mgmt.nats-io.svc on 100.64.0.10:53: no such host

it seems that nats will never stop trying to find the members that I've deleted...

Operator crashes constantly on Azure AKS

I am trying to deploy NATS using NATS Operator on a 3 Node Kubernetes (v. 1.9.6) cluster on Azure, following this projects readme.md. While the NATS pods appear being fine, the operator is in a never ending crash/restart loop.

The k8s resources and logs of the NATS pods are listed below. At the very end of the operators log an error and a fatal entry are the last things that are logged before the operator goes down.

PS C:\src\aks_test> kubectl get all -o wide
NAME                                 READY     STATUS             RESTARTS   AGE       IP           NODE
pod/example-nats-cluster-1           1/1       Running            0          24m       10.244.2.5   aks-nodepool1-32539510-1
pod/example-nats-cluster-2           1/1       Running            0          23m       10.244.1.5   aks-nodepool1-32539510-2
pod/example-nats-cluster-3           1/1       Running            0          23m       10.244.0.4   aks-nodepool1-32539510-0
pod/nats-operator-7fdf945577-jxg5s   0/1       CrashLoopBackOff   7          24m       10.244.1.4   aks-nodepool1-32539510-2

NAME                                TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)             AGE       SELECTOR
service/example-nats-cluster        ClusterIP   10.0.5.144   <none>        4222/TCP            24m       app=nats,nats_cluster=example-nats-cluster
service/example-nats-cluster-mgmt   ClusterIP   None         <none>        6222/TCP,8222/TCP   24m       app=nats,nats_cluster=example-nats-cluster,nats_version=1.1.0
service/kubernetes                  ClusterIP   10.0.0.1     <none>        443/TCP             30m       <none>

NAME                                  DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE       CONTAINERS      IMAGES                                           SELECTOR
deployment.extensions/nats-operator   1         1         1            0           24m       nats-operator   connecteverything/nats-operator:0.2.2-v1alpha2   name=nats-operator

NAME                                             DESIRED   CURRENT   READY     AGE       CONTAINERS      IMAGES                                           SELECTOR
replicaset.extensions/nats-operator-7fdf945577   1         1         0         24m       nats-operator   connecteverything/nats-operator:0.2.2-v1alpha2   name=nats-operator,pod-
template-hash=3989501133

NAME                            DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE       CONTAINERS      IMAGES                                           SELECTOR
deployment.apps/nats-operator   1         1         1            0           24m       nats-operator   connecteverything/nats-operator:0.2.2-v1alpha2   name=nats-operator

NAME                                       DESIRED   CURRENT   READY     AGE       CONTAINERS      IMAGES                                           SELECTOR
replicaset.apps/nats-operator-7fdf945577   1         1         0         24m       nats-operator   connecteverything/nats-operator:0.2.2-v1alpha2   name=nats-operator,pod-templa
te-hash=3989501133
PS C:\src\aks_test> kubectl logs example-nats-cluster-1
[1] 2018/06/24 07:57:54.986814 [INF] Starting nats-server version 1.1.0
[1] 2018/06/24 07:57:54.986911 [INF] Git commit [add6d79]
[1] 2018/06/24 07:57:54.987207 [INF] Starting http monitor on 0.0.0.0:8222
[1] 2018/06/24 07:57:54.987305 [INF] Listening for client connections on 0.0.0.0:4222
[1] 2018/06/24 07:57:54.987370 [INF] Server is ready
[1] 2018/06/24 07:57:54.987877 [INF] Listening for route connections on 0.0.0.0:6222
[1] 2018/06/24 07:58:00.347433 [INF] 10.244.1.5:52728 - rid:1 - Route connection created
[1] 2018/06/24 07:58:07.904174 [INF] 10.244.0.4:60902 - rid:2 - Route connection created
PS C:\src\aks_test> kubectl logs example-nats-cluster-2
[1] 2018/06/24 07:58:00.328032 [INF] Starting nats-server version 1.1.0
[1] 2018/06/24 07:58:00.328063 [INF] Git commit [add6d79]
[1] 2018/06/24 07:58:00.328240 [INF] Starting http monitor on 0.0.0.0:8222
[1] 2018/06/24 07:58:00.328276 [INF] Listening for client connections on 0.0.0.0:4222
[1] 2018/06/24 07:58:00.328286 [INF] Server is ready
[1] 2018/06/24 07:58:00.328526 [INF] Listening for route connections on 0.0.0.0:6222
[1] 2018/06/24 07:58:00.336456 [INF] 10.244.2.5:6222 - rid:1 - Route connection created
[1] 2018/06/24 07:58:00.373683 [ERR] Error trying to connect to route: dial tcp: lookup example-nats-cluster-2.example-nats-cluster-mgmt.default.svc on 10.0.0.10:53: no such hos
t
[1] 2018/06/24 07:58:01.377851 [INF] 10.244.1.5:43288 - rid:2 - Route connection created
[1] 2018/06/24 07:58:01.378108 [INF] 10.244.1.5:6222 - rid:3 - Route connection created
[1] 2018/06/24 07:58:07.893824 [INF] 10.244.0.4:60642 - rid:4 - Route connection created
PS C:\src\aks_test> kubectl logs example-nats-cluster-3
[1] 2018/06/24 07:58:07.882819 [INF] Starting nats-server version 1.1.0
[1] 2018/06/24 07:58:07.882851 [INF] Git commit [add6d79]
[1] 2018/06/24 07:58:07.883013 [INF] Starting http monitor on 0.0.0.0:8222
[1] 2018/06/24 07:58:07.883045 [INF] Listening for client connections on 0.0.0.0:4222
[1] 2018/06/24 07:58:07.883055 [INF] Server is ready
[1] 2018/06/24 07:58:07.883346 [INF] Listening for route connections on 0.0.0.0:6222
[1] 2018/06/24 07:58:07.893434 [INF] 10.244.2.5:6222 - rid:1 - Route connection created
[1] 2018/06/24 07:58:07.893807 [INF] 10.244.1.5:6222 - rid:2 - Route connection created
[1] 2018/06/24 07:58:07.901949 [ERR] Error trying to connect to route: dial tcp: lookup example-nats-cluster-3.example-nats-cluster-mgmt.default.svc on 10.0.0.10:53: no such hos
t
[1] 2018/06/24 07:58:08.906155 [INF] 10.244.0.4:6222 - rid:3 - Route connection created
[1] 2018/06/24 07:58:08.906461 [INF] 10.244.0.4:36868 - rid:4 - Route connection created
PS C:\src\aks_test> kubectl logs -f nats-operator-7fdf945577-jxg5s
time="2018-06-24T08:22:07Z" level=info msg="nats-operator Version: 0.2.2-v1alpha2+git"
time="2018-06-24T08:22:07Z" level=info msg="Git SHA: fb2847b"
time="2018-06-24T08:22:07Z" level=info msg="Go Version: go1.9"
time="2018-06-24T08:22:07Z" level=info msg="Go OS/Arch: linux/amd64"
time="2018-06-24T08:22:08Z" level=info msg="Event(v1.ObjectReference{Kind:\"Endpoints\", Namespace:\"default\", Name:\"nats-operator\", UID:\"3902fb5a-7784-11e8-b0cb-0a58ac1f054
6\", APIVersion:\"v1\", ResourceVersion:\"3266\", FieldPath:\"\"}): type: 'Normal' reason: 'LeaderElection' nats-operator-7fdf945577-jxg5s became leader"
time="2018-06-24T08:22:08Z" level=info msg="finding existing clusters..." pkg=controller
time="2018-06-24T08:22:08Z" level=info msg="starts running from watch version: 3270" pkg=controller
time="2018-06-24T08:22:08Z" level=info msg="start running..." cluster-name=example-nats-cluster pkg=cluster
time="2018-06-24T08:22:08Z" level=info msg="start watching at 3270" pkg=controller
time="2018-06-24T08:23:08Z" level=error msg="received invalid event from API server: fail to decode raw event from apiserver (unexpected EOF)" pkg=controller
time="2018-06-24T08:23:08Z" level=fatal msg="controller Run() ended with failure: fail to decode raw event from apiserver (unexpected EOF)"

Full repro:

Prerequisites:

Azure Subscription (the free trial is sufficient)
Azure CLI installed (see https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest) or run browser based Azure Cloud shell from portal.azure.com

$ az login
$ az provider register -n Microsoft.Network
$ az provider register -n Microsoft.Storage
$ az provider register -n Microsoft.Compute
$ az provider register -n Microsoft.ContainerService
$ az group create --name aks --location westeurope
$ az aks create --resource-group aks --name aksCluster --node-count 3 --generate-ssh-keys #Create k8s cluster with 3 nodes, this will take 10 to 15 minutes
$ az aks install-cli #Install kubectl, may not be necessary if already installed or if running Azure Cloud Shell
$ az aks get-credentials --resource-group aks --name aksCluster #Setup kubectl to connect to our Azure k8s cluster
$ kubectl apply -f https://raw.githubusercontent.com/nats-io/nats-operator/88f19bcd7da571a3004c364859ebbce3202c510e/example/deployment.yaml
$ echo '
apiVersion: "nats.io/v1alpha2"
kind: "NatsCluster"
metadata:
  name: "example-nats-cluster"
spec:
  size: 3
  version: "1.1.0"
' | kubectl apply -f -

$ az group delete --name aks --yes --no-wait #Delete everything once you don't need the cluster anymore

Move from glide to dep

See https://github.com/golang/dep

Kubernetes ServiceAccounts Integration with NATS Authorization

In last release, it was added support for authorization using a custom secret where the credentials are defined in JSON, but we might be able to simplify this by using the service accounts already present in Kubernetes.

Below a full example of a manifest creating 2 pods, one binding to the default service account and the other to a new service account, then a couple of custom ServiceRoles (new CRD managed by the operator) created to define the pub/sub permissions. A NATS client running in one of the pods would be able to use the token provided by the service account, and in case of changes the configuration would be reloaded with the new authorization rules.

# Container using a new service account
nats-sub -s nats://demo-nats-service-account:`cat /var/run/secrets/kubernetes.io/serviceaccount/token`@demo-nats:4222 SANDBOX.hello

# Container using the default service account
nats-sub -s nats://default:`cat /var/run/secrets/kubernetes.io/serviceaccount/token`@demo-nats:4222 SANDBOX.hello

Example manifest of how this would work:

---
apiVersion: nats.io/v1alpha3
kind: NatsCluster
metadata:
  name: demo-nats
spec:
  size: 3
  version: "1.1.0"
  pod:
    enableConfigReload: true
  auth:
    enableServiceAccounts: true
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: demo-nats-service-account
---
apiVersion: nats.io/v1alpha3
kind: ServiceRole
metadata:
  name: demo-nats-role
  namespace: default

  # Specifies which NATS cluster will be mapping this account,
  # (have to create a service role with permission per cluster).
  labels:
    nats_cluster: demo-nats
spec:
  serviceAccountName: demo-nats-service-account
  permissions:
    publish: ["foo.*", "foo.bar.quux"]
    subscribe: ["foo.bar"]
---
apiVersion: nats.io/v1alpha3
kind: ServiceRole
metadata:
  name: demo-nats-default-role
  namespace: default

  # Specifies which NATS cluster will be mapping this account,
  # (have to create a service role with permission per cluster).
  labels:
    nats_cluster: demo-nats
spec:
  serviceAccountName: default
  permissions:
    publish: ["SANDBOX.>"]
    subscribe: ["SANDBOX.>"]
---
apiVersion: nats.io/v1alpha3
kind: ServiceRole
metadata:
  name: default
  namespace: default
  labels:
    nats_cluster: demo-nats
spec:
  serviceAccountName: demo-nats-service-account
  permissions:
    publish: ["foo.*"]
    subscribe: ["foo.bar"]
---
apiVersion: v1
kind: Pod
metadata:
  name: demo-nats-client-pod
spec:
  serviceAccountName: demo-nats-service-account
  restartPolicy: Never
  containers:
    - name: nats-ops
      command: ["/bin/sh"]
      image: "wallyqs/nats-ops:latest"
      tty: true
      stdin: true
      stdinOnce: true
---
apiVersion: v1
kind: Pod
metadata:
  name: demo-nats-client-pod-default-account
spec:
  # No account means using the default service token
  restartPolicy: Never
  containers:
    - name: nats-ops
      command: ["/bin/sh"]
      image: "wallyqs/nats-ops:latest"
      tty: true
      stdin: true
      stdinOnce: true

Upgrade to latest NATS 1.x.

Support for deny/allow permissions

Update to use new allow/deny syntax built into NATS v1.3.0, need to consider backwards compatibility with using array of permissions as well:

authorization {
    myUserPerms = {
      publish = {
        allow = "*.*"
        deny = ["SYS.*", "bar.baz", "foo.*"]
      }
      subscribe = {
        allow = "foo.*"
        deny = "foo.baz"
      }
    }

    users = [
        {user: myUser, password: pwd, permissions: $myUserPerms}
    ]
}

https://github.com/nats-io/nats-operator/blob/master/pkg/conf/natsconf.go#L55-L60

Didn't get any resource after apply example/deployment.yaml

I'm trying the example in README.md

I did kubectl apply -f https://raw.githubusercontent.com/nats-io/nats-operator/master/example/deployment.yaml

But kubectl get crd got nothing.

here is my kubectl get all :

NAME                   DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
deploy/nats-operator   1         1         1            1           8m

NAME                          DESIRED   CURRENT   READY     AGE
rs/nats-operator-7b89ff4879   1         1         1         8m

NAME                   DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
deploy/nats-operator   1         1         1            1           8m

NAME                          DESIRED   CURRENT   READY     AGE
rs/nats-operator-7b89ff4879   1         1         1         8m

NAME                                READY     STATUS    RESTARTS   AGE
po/nats-operator-7b89ff4879-hqm5j   1/1       Running   0          8m

NAME             TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)   AGE
svc/kubernetes   ClusterIP   10.11.240.1   <none>        443/TCP   49m

service?

How to create a service?

apiVersion: "nats.io/v1alpha2"
kind: "NatsCluster"
metadata:
  name: "cryptovalue-nats"
spec:
  size: 3
  version: "1.1.0"

Nats-Operator incompatible with istio?

When I follow the instructions in the project readme to create a nats cluster with 3 members on a gke cluster using istio, all three members immediately show unhealthy and quickly go to crashloopbackoff. Is there something additional I need to do to get nats-operator to play nice with a service mesh?

My Nats Cluster:

echo '
apiVersion: "nats.io/v1alpha2"
kind: "NatsCluster"
metadata:
  name: "example-nats-cluster"
spec:
  size: 3
  version: "1.3.0"
' | kubectl apply -f -

Log from one member:

[1] 2018/10/30 20:27:15.907885 [INF] Starting nats-server version 1.3.0
[1] 2018/10/30 20:27:15.907943 [INF] Git commit [eed4fbc]
[1] 2018/10/30 20:27:15.908133 [INF] Starting http monitor on 0.0.0.0:8222
[1] 2018/10/30 20:27:15.908194 [INF] Listening for client connections on 0.0.0.0:4222
[1] 2018/10/30 20:27:15.908208 [INF] Server is ready
[1] 2018/10/30 20:27:15.908541 [INF] Listening for route connections on 0.0.0.0:6222
[1] 2018/10/30 20:27:15.914868 [ERR] Error trying to connect to route: dial tcp 10.12.12.4:6222: connect: connection refused
[1] 2018/10/30 20:27:16.930604 [ERR] Error trying to connect to route: dial tcp 10.12.12.4:6222: connect: connection refused
[1] 2018/10/30 20:27:17.935214 [INF] 10.12.12.4:6222 - rid:1 - Route connection created
[1] 2018/10/30 20:27:17.940613 [INF] 127.0.0.1:41486 - rid:2 - Route connection created
[1] 2018/10/30 20:27:18.962862 [INF] 10.12.12.4:6222 - rid:3 - Route connection created

(and the Route connection messages continue 290 times before the container is shut down as unhealthy)

My Istio deployment is the default Isitio App from the GCP marketplace, with three nodes in it.
K8S version info:

Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.7", GitCommit:"0c38c362511b20a098d7cd855f1314dad92c2780", GitTreeState:"clean", BuildDate:"2018-08-20T10:09:03Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"9+", GitVersion:"v1.9.7-gke.6", GitCommit:"9b635efce81582e1da13b35a7aa539c0ccb32987", GitTreeState:"clean", BuildDate:"2018-08-16T21:33:47Z", GoVersion:"go1.9.3b4", Compiler:"gc", Platform:"linux/amd64"}

istio-pilot version is 1.3

I'd be happy to add more detail if there are follow up questions. I can also cross-post this issue to Istio if the problem appears to be on their side...

Revamp upgrade

Need to find a way to upgrade live clusters with minimum disruption, e.g. no message loss.

	leaderelection.RunOrDie(context.TODO(), leaderelection.LeaderElectionConfig{
	Lock: rl,
	LeaseDuration: 15 * time.Second,
	RenewDeadline: 10 * time.Second,
	RetryPeriod: 2 * time.Second,
	Callbacks: leaderelection.LeaderCallbacks{
	OnStartedLeading: func(ctx context.Context) {
	run(ctx, kubeCfg, kubeClient)
	},
	OnStoppedLeading: func() {
	logrus.Fatalf("leader election lost")
	},
	},
	})

	func Marshal(conf *ServerConfig) ([]byte, error) {
	js, err := json.MarshalIndent(conf, "", " ")
	if err != nil {
	return nil, err
	}
	if len(js) < 1 \|\| len(js)-1 <= 1 {
	return nil, ErrInvalidConfig
	}

	// Slice the initial and final brackets from the
	// resulting JSON configuration so gnatsd config parsers
	// almost treats it as valid config.
	js = js[1:]
	js = js[:len(js)-1]

	// Replacing all commas with line breaks still keeps
	// arrays valid and makes the top level configuration
	// be able to be parsed as gnatsd config.
	result := bytes.Replace(js, []byte(","), []byte("\n"), -1)

	return result, nil
	}

nats-io / nats-operator Goto Github PK

nats-operator's Issues

Example

Potential Solution

Full repro:

Recommend Projects

Recommend Topics

Recommend Org