rabbitmq / rabbitmq-peer-discovery-k8s Goto Github PK

View Code? Open in Web Editor NEW

296.0 30.0 94.0 570 KB

Kubernetes-based peer discovery mechanism for RabbitMQ

License: Other

Makefile 36.91% Erlang 54.01% Vim Snippet 9.07%

rabbitmq clustering kubernetes k8s rabbitmq-plugin

rabbitmq-peer-discovery-k8s's Introduction

RabbitMQ Peer Discovery Kubernetes

This was migrated to https://github.com/rabbitmq/rabbitmq-server

This repository has been moved to the main unified RabbitMQ "monorepo", including all open issues. You can find the source under /deps/rabbitmq_peer_discovery_k8s. All issues have been transferred.

Overview

This is an implementation of RabbitMQ peer discovery interface for Kubernetes.

This plugin only performs peer discovery using Kubernetes API as the source of data on running cluster pods. Please get familiar with RabbitMQ clustering fundamentals before attempting to use it.

Cluster provisioning and most of Day 2 operations such as proper monitoring are not in scope for this plugin.

For a more comprehensive open source RabbitMQ on Kubernetes deployment solution, see the RabbitMQ Cluster Operator for Kubernetes. The Operator is developed on GitHub and contains its own set of examples.

Supported RabbitMQ Versions

This plugin ships with RabbitMQ 3.7.0 or later.

Installation

This plugin ships with supported RabbitMQ versions. There is no need to install it separately.

As with any plugin, it must be enabled before it can be used. For peer discovery plugins it means they must be enabled or preconfigured before first node boot:

rabbitmq-plugins --offline enable rabbitmq_peer_discovery_k8s

Documentation

See RabbitMQ Cluster Formation guide for an overview of the peer discovery subsystem, general and Kubernetes-specific configurable values and troubleshooting tips.

Example deployments that use this plugin can be found in an RabbitMQ on Kubernetes examples repository. Note that they are just that, examples, and won't be optimal for every use case or cover a lot of important production system concerns such as monitoring, persistent volume settings, access control, sizing, and so on.

Contributing

See CONTRIBUTING.md and our development process overview.

License

Licensed under the MPL, same as RabbitMQ server.

Copyright

rabbitmq-peer-discovery-k8s's People

Contributors

Stargazers

Watchers

Forkers

aland-zhang rfancn seslattery nreisbeck gongchengcheng lohithchanda etsangsplk zachmend varyumin coolpalani mrmm sergeytkachenko wufenglinux littlegump notandy jasonbrownbridge blueapple1120 raphaelamorim lizhibin-ata jdbaldry gobomb joychen130 johnroach cnadeau johnatbos moshe0076 feiyu563 harryhurtad neopug linus5 switcher05 jsp93 kannans mosfet1kg msukhotskaya hifreelight spring-operator wangrzneu jamoflaw mashine3189 ndn1991 jdobbin-hsahealthplan anoopswsib jasil1414 gokulkraju1993 ashraf-revo sarojara vivienfricadelamadeus aefox saggarwal98 huynhtruong stormcrows nikbhadane ssajous jiangshicunit felippemsc liuyinglovecode st3v wckai josesleiter harryge00 wizquest-labs johanneswuerbach linehrr pcachaldora yashbhutwala calvinkkd virajs nanofabricfx kopipeng huangxuelun creativedutchmen tarkush libin2722 forgottentoys jiten-etouch 491134648 artem198315 sudeep-ib isabella232 karthik-reddy-tom panguicai008 ywsfay lancefreestyle youjunyi fengzhisha uzbekdev1 mrtoms cheyunhua 10088 jinqiu881015 tryweirdier timyuheng

rabbitmq-peer-discovery-k8s's Issues

Provided example uses "rabbitmqctl status" as livenessProbe and readinessProbe

rabbitmq-peer-discovery-k8s/examples/k8s_statefulsets/rabbitmq_statefulsets.yaml

Line 82 in 400693b

command: ["rabbitmqctl", "status"]

uses 'rabbitmqctl status' as livenessProbe and readinessProbe. However, a cluster node may be returning 0 on 'rabbitmqctl status' and not having joined the cluster.

Example (the reason why this happened is beyond this issue):

$ kubectl exec -n rmq rabbitmq-1 -- rabbitmqctl status
Status of node [email protected] ...
[{pid,324},
 {running_applications,
     [{inets,"INETS  CXC 138 49","6.4.4"},
      {cowboy,"Small, fast, modern HTTP server.","2.2.2"},
      {amqp_client,"RabbitMQ AMQP Client","3.7.4"},
      {rabbit_common,
          "Modules shared by rabbitmq-server and rabbitmq-erlang-client",
          "3.7.4"},
      {ranch_proxy_protocol,"Ranch Proxy Protocol Transport","1.4.4"},
      {ranch,"Socket acceptor pool for TCP protocols.","1.4.0"},
      {ssl,"Erlang/OTP SSL application","8.2.2"},
      {public_key,"Public key infrastructure","1.5.1"},
      {asn1,"The Erlang ASN1 compiler version 5.0.3","5.0.3"},
      {jsx,"a streaming, evented json parsing toolkit","2.8.2"},
      {recon,"Diagnostic tools for production use","2.3.2"},
      {mnesia,"MNESIA  CXC 138 12","4.15.1"},
      {xmerl,"XML parser","1.3.15"},
      {cowlib,"Support library for manipulating Web protocols.","2.1.0"},
      {crypto,"CRYPTO","4.1"},
      {os_mon,"CPO  CXC 138 46","2.4.3"},
      {lager,"Erlang logging framework","3.5.1"},
      {goldrush,"Erlang event stream processor","0.1.9"},
      {compiler,"ERTS  CXC 138 10","7.1.3"},
      {syntax_tools,"Syntax tools","2.1.3"},
      {sasl,"SASL  CXC 138 11","3.1"},
      {stdlib,"ERTS  CXC 138 10","3.4.2"},
      {kernel,"ERTS  CXC 138 10","5.4"}]},
 {os,{unix,linux}},
 {erlang_version,
     "Erlang/OTP 20 [erts-9.1.5] [source] [64-bit] [smp:1:1] [ds:1:1:10] [async-threads:64] [hipe] [kernel-poll:true]\n"},
 {memory,
     [{connection_readers,0},
      {connection_writers,0},
      {connection_channels,0},
      {connection_other,2840},
      {queue_procs,0},
      {queue_slave_procs,0},
      {plugins,17608},
      {other_proc,21117232},
      {metrics,183144},
      {mgmt_db,0},
      {mnesia,67736},
      {other_ets,2078472},
      {binary,414056},
      {msg_index,0},
      {code,28418634},
      {atom,1123529},
      {other_system,21374669},
      {allocated_unused,20139168},
      {reserved_unallocated,0},
      {strategy,rss},
      {total,[{erlang,74797920},{rss,77967360},{allocated,94937088}]}]},
 {alarms,[]},
 {listeners,[]},
 {vm_memory_calculation_strategy,rss},
 {vm_memory_high_watermark,0.4},
 {vm_memory_limit,1554146918},
 {disk_free_limit,50000000},
 {disk_free,9454571520},
 {file_descriptors,
     [{total_limit,1048476},
      {total_used,0},
      {sockets_limit,943626},
      {sockets_used,0}]},
 {processes,[{limit,1048576},{used,150}]},
 {run_queue,0},
 {uptime,522},
 {kernel,{net_ticktime,60}}]

$ kubectl exec -n rmq rabbitmq-1 -- rabbitmqctl node_health_check
Error: this command requires the 'rabbit' app to be running on the target node. Start it with 'rabbitmqctl start_app'.
Arguments given:
        node_health_check
Usage:
rabbitmqctl [-n <node>] [-t <timeout>] [-q] node_health_check
command terminated with exit code 64

this raises the doubt of what's the best liveness and readiness probe.

If we consider a cluster already formed, one cloud say that both could be 'rabbitmqctl cluster_status'. But this doesn't take into account a forming cluster where that command would prevent the initial node from starting and then the cluster to form.

Between 'status', 'node_health_check' and 'cluster_status' (are there other candidates?), how should the choice be made?

For completeness here's 'node_health_check' for that failing node:

$ kubectl exec -n rmq rabbitmq-1 -- rabbitmqctl node_health_check
Error: this command requires the 'rabbit' app to be running on the target node. Start it with 'rabbitmqctl start_app'.
Arguments given:
        node_health_check
Usage:
rabbitmqctl [-n <node>] [-t <timeout>] [-q] node_health_check
command terminated with exit code 64

Support new style (Cuttlefish) config format

See rabbitmq/rabbitmq-peer-discovery-aws#1, rabbitmq/rabbitmq-peer-discovery-etcd#2

PT: 146624729

Dead rabbitMQ pod stays as "node not connected

Can rabbitmq-peer-discovery-k8s discover the service through SERVICE and automatically create a cluster?

I noticed that the automatic creation of the cluster was done by discovering the pod IP.
rabbitmq.conf

cluster_formation.k8s.address_type = ip

StatefulSet

name: RABBITMQ_NODENAME
value: "rabbit@$(MY_POD_IP)"

The IP of the pod after re-creation is variable，this means that the cluster nodes are also changing.

rabbitmq-autocluster can discover services through SERVICE Name
E.g

    - name: K8S_SERVICE_NAME
      value: "rabbitmq"
    - name: RABBITMQ_NODENAME
      value: "rabbit@$(MY_POD_NAME).$(K8S_SERVICE_NAME)"

pods crashed with 404 error

Hi,

I run the rabbitmq V3.7.3 over k8s 1.10, and pods keep crashing with the following error message in log. I am sure I applied the rbac rule, there is no 403 error.

2018-04-22 17:37:16.745 [info] <0.195.0> Memory high watermark set to 77308 MiB (81064201420 bytes) of 193272 MiB (202660503552 bytes) total
2018-04-22 17:37:16.753 [info] <0.197.0> Enabling free disk space monitoring
2018-04-22 17:37:16.753 [info] <0.197.0> Disk free limit set to 50MB
2018-04-22 17:37:16.761 [info] <0.199.0> Limiting to approx 65436 file handles (58890 sockets)
2018-04-22 17:37:16.762 [info] <0.200.0> FHC read buffering: OFF
2018-04-22 17:37:16.762 [info] <0.200.0> FHC write buffering: ON
2018-04-22 17:37:16.765 [info] <0.187.0> Node database directory at /mnt/data/rabbitmq/wf2-rabbitmq-2/[email protected] is empty. Assuming we need to join an existing cluster or initialise from scratch...
2018-04-22 17:37:16.765 [info] <0.187.0> Configured peer discovery backend: rabbit_peer_discovery_k8s
2018-04-22 17:37:16.765 [info] <0.187.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s
2018-04-22 17:37:16.765 [info] <0.187.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized startup delay.
2018-04-22 17:37:16.803 [info] <0.187.0> Failed to get nodes from k8s - 404
2018-04-22 17:37:16.803 [error] <0.186.0> CRASH REPORT Process <0.186.0> with 0 neighbours exited with reason: no case clause matching {error,"404"} in rabbit_mnesia:init_from_config/0 line 163 in application_master:init/4 line 134
2018-04-22 17:37:16.804 [info] <0.31.0> Application rabbit exited with reason: no case clause matching {error,"404"} in rabbit_mnesia:init_from_config/0 line 163
{"Kernel pid terminated",application_controller,"{application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{{case_clause,{error,"404"}},[{rabbit_mnesia,init_from_config,0,[{file,"src/rabbit_mnesia.erl"},{line,163}]},{rabbit_mnesia,init_with_lock,3,[{file,"src/rabbit_mnesia.erl"},{line,143}]},{rabbit_mnesia,init,0,[{file,"src/rabbit_mnesia.erl"},{line,111}]},{rabbit_boot_steps,'-run_step/2-lc$^1/1-1-',1,[{file,"src/rabbit_boot_steps.erl"},{line,49}]},{rabbit_boot_steps,run_step,2,[{file,"src/rabbit_boot_steps.erl"},{line,49}]},{rabbit_boot_steps,'-run_boot_steps/1-lc$^0/1-0-',1,[{file,"src/rabbit_boot_steps.erl"},{line,26}]},{rabbit_boot_steps,run_boot_steps,1,[{file,"src/rabbit_boot_steps.erl"},{line,26}]},{rabbit,start,2,[{file,"src/rabbit.erl"},{line,792}]}]}}}}}"}
Kernel pid terminated (application_controller) ({application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{{case_clause,{error,"404"}},[{rabbit_mnesia,init_from_config,0,[{file

kubernetes-peer discovery using hostname/nodename

Hi,
Can you provide an example using the hostname rather than IP address.
Using an IP address in a fluid environment like K8S where pods can be killed / spun up any number of times with different IP addresses, using IPs are inherently bad.
Especially when you need backup/restore from persistnt volumes and you end up with a bunch if folders in the mnesia folder that all have old / stale IP addresses.

RabbitMQ Pods are restarting on kubernetes

I have deployed rabbitmq on kubernetes 1.14 using helm chart ha. After few hours, the pods are restarting without any reason. The issue is exactly same as the one mentioned by another user : #43

I am having same issues as the other person in that thread.

Still need to manually stop, join_cluster and start every node?

Deployed the sample yaml files but doesn't work as expected. I still need to run a batch job on the cluster to manually stop, join_cluster and start for every pod. In addition, is there a way to configure vhosts, users and credentials using rabbitmq.conf file? Otherwise I have to do it in kerbernetes batch job after all the pods have booted up.

add persistence to the statefulset

Please add storage i.e pvc to the rabbit stateful set. Without any storage specified messages will not persist. So kindly add persistence to the rabbit sttefulset

containers not initializing when using hostname instead of IP

I'm not sure how others are doing it, but all the examples I've seen are to use rabbit@$(MY_POD_IP) for RABBITMQ_NODENAME but the issue I came across is if all the pods are rebooted at the same time all data is lost as well because the database dir ends up being /var/lib/rabbitmq/mnesia/[email protected] and the IP address changes every time the POD is restarted.

So I strayed from the examples and used rabbit@$(MY_POD_NAME) instead in which case the database dir ends up as /var/lib/rabbitmq/mnesia/rabbit@rabbitmq-0 which is awesome.

I changed the configmap to:

cluster_formation.k8s.hostname_suffix = rabbitmq.default.svc.cluster.local
cluster_formation.k8s.address_type = hostname

But now the containers don't start fully, no cluster is created, and the last error in the log is:

[warning] <0.32.0> lager_error_logger_h dropped 28 messages in the last second that exceeded the limit of 100 messages/sec

Which doesn't help a whole lot.

How is anyone getting data persistence when using this peer discovery plugin?

Kubernetes API requests in a pure IPv6 environment fail with an "nxdomain"

Hi,
I had a pure ipv6 k8s cluster. and i want to instal rabbitmq helm chart.
I followed the instrument in https://www.rabbitmq.com/networking.html#distribution-ipv6
My parameter(in helm chart):

   environment: |-
      RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS="+A 128 -kernel inetrc '/etc/rabbitmq/erl_inetrc'  -proto_dist inet6_tcp"
      RABBITMQ_CTL_ERL_ARGS="-proto_dist inet6_tcp "
  erl_inetrc: |-
    {inet6, true}.

File erl_inetrc was created under /etc/rabbitmq.
and I found error in log:

2019-10-15 07:33:55.000 [info] <0.238.0> Peer discovery backend does not support locking, falling back to randomized delay
2019-10-15 07:33:55.000 [info] <0.238.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized start
up delay.
2019-10-15 07:33:55.000 [debug] <0.238.0> GET https://kubernetes.default.svc.cluster.local:443/api/v1/namespaces/tazou/endpoints/zt4-crmq
2019-10-15 07:33:55.015 [debug] <0.238.0> Response: {error,{failed_connect,[{to_address,{"kubernetes.default.svc.cluster.local",443}},{inet,[inet]
,nxdomain}]}}
2019-10-15 07:33:55.015 [debug] <0.238.0> HTTP Error {failed_connect,[{to_address,{"kubernetes.default.svc.cluster.local",443}},{inet,[inet],nxdom
ain}]}
2019-10-15 07:33:55.015 [info] <0.238.0> Failed to get nodes from k8s - {failed_connect,[{to_address,{"kubernetes.default.svc.cluster.local",443}}
,
                 {inet,[inet],nxdomain}]}
2019-10-15 07:33:55.016 [error] <0.237.0> CRASH REPORT Process <0.237.0> with 0 neighbours exited with reason: no case clause matching {error,"{fa
iled_connect,[{to_address,{\"kubernetes.default.svc.cluster.local\",443}},\n                 {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from
_config/0 line 167 in application_master:init/4 line 138
2019-10-15 07:33:55.016 [info] <0.43.0> Application rabbit exited with reason: no case clause matching {error,"{failed_connect,[{to_address,{\"kub
ernetes.default.svc.cluster.local\",443}},\n                 {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from_config/0 line 167

the inet could return ipv6 address.

[root]# kubectl exec -ti zt5-crmq-0 rabbitmqctl eval 'inet:gethostbyname("kubernetes.default.svc.cluster.local", inet6).'
{ok,{hostent,"kubernetes.default.svc.cluster.local",[],inet6,16,
             [{64769,43981,0,0,0,0,0,1}]}}

[root]#  kubectl exec -ti zt5-crmq-0 rabbitmqctl eval 'inet_res:resolve("kubernetes.default.svc.cluster.local", in, aaaa).'
{ok,{dns_rec,{dns_header,1,true,query,true,false,true,true,false,0},
             [{dns_query,"kubernetes.default.svc.cluster.local",aaaa,in}],
             [{dns_rr,"kubernetes.default.svc.cluster.local",aaaa,in,0,5,
                      {64769,43981,0,0,0,0,0,1},
                      undefined,[],false}],
             [],[]}}

nslookup return ipv6 address when type=aaaa.
return error when type=a.

I don't know why httpc:request will return nxdomain.
is it a bug or setting issue?

B.R,
Tao

Dead rabbitMQ pod stays as "node not connected" and it loses the persistent messages.

Hello, i've created a 5 node cluster of rabbitMQ - and created a persistent queue, sent them a persistent message.
I've then killed the node, another one was created (of course, different IP) - but the old node (which is dead) is considered as not connected + i guess i cannot recover data there.

I haven't seen in documentation what to do in those cases, Please assist?

Peers discovered but filtered out as non-eligible: k8s endpoint listing returned nodes not yet ready

I'm trying to make a rabbitmq cluster witch 2 node by useing the rabbitmq-peer-discovery-k8s.But both of 2 rabbitmq node are running alone.

rabbimq-0's log

2019-09-29 09:47:22.685 [info] <0.8.0> Feature flags: list of feature flags found:
2019-09-29 09:47:22.686 [info] <0.8.0> Feature flags: feature flag states written to disk: yes
2019-09-29 09:47:22.742 [info] <0.234.0> 
 Starting RabbitMQ 3.7.18 on Erlang 22.1
 Copyright (C) 2007-2019 Pivotal Software, Inc.
 Licensed under the MPL.  See https://www.rabbitmq.com/

  ##  ##
  ##  ##      RabbitMQ 3.7.18. Copyright (C) 2007-2019 Pivotal Software, Inc.
  ##########  Licensed under the MPL.  See https://www.rabbitmq.com/
  ######  ##
  ##########  Logs: <stdout>

              Starting broker...
2019-09-29 09:47:22.743 [info] <0.234.0> 
 node           : rabbit@rabbitmq-0
 home dir       : /var/lib/rabbitmq
 config file(s) : /etc/rabbitmq/rabbitmq.conf
 cookie hash    : XhdCf8zpVJeJ0EHyaxszPg==
 log(s)         : <stdout>
 database dir   : /var/lib/rabbitmq/mnesia/rabbit@rabbitmq-0
2019-09-29 09:47:22.764 [info] <0.234.0> Running boot step pre_boot defined by app rabbit
2019-09-29 09:47:22.764 [info] <0.234.0> Running boot step rabbit_core_metrics defined by app rabbit
2019-09-29 09:47:22.764 [info] <0.234.0> Running boot step rabbit_alarm defined by app rabbit
2019-09-29 09:47:22.776 [info] <0.240.0> Memory high watermark set to 1907 MiB (2000000000 bytes) of 3790 MiB (3974164480 bytes) total
2019-09-29 09:47:22.804 [info] <0.242.0> Enabling free disk space monitoring
2019-09-29 09:47:22.804 [info] <0.242.0> Disk free limit set to 4000MB
2019-09-29 09:47:22.809 [info] <0.234.0> Running boot step code_server_cache defined by app rabbit
2019-09-29 09:47:22.809 [info] <0.234.0> Running boot step file_handle_cache defined by app rabbit
2019-09-29 09:47:22.809 [info] <0.245.0> Limiting to approx 65436 file handles (58890 sockets)
2019-09-29 09:47:22.810 [info] <0.246.0> FHC read buffering:  OFF
2019-09-29 09:47:22.810 [info] <0.246.0> FHC write buffering: ON
2019-09-29 09:47:22.812 [info] <0.234.0> Running boot step worker_pool defined by app rabbit
2019-09-29 09:47:22.812 [info] <0.235.0> Will use 2 processes for default worker pool
2019-09-29 09:47:22.812 [info] <0.235.0> Starting worker pool 'worker_pool' with 2 processes in it
2019-09-29 09:47:22.812 [info] <0.234.0> Running boot step database defined by app rabbit
2019-09-29 09:47:22.813 [info] <0.234.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@rabbitmq-0 is empty. Assuming we need to join an existing cluster or initialise from scratch...
2019-09-29 09:47:22.813 [info] <0.234.0> Configured peer discovery backend: rabbit_peer_discovery_k8s
2019-09-29 09:47:22.813 [info] <0.234.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s
2019-09-29 09:47:22.813 [info] <0.234.0> Peer discovery backend does not support locking, falling back to randomized delay
2019-09-29 09:47:22.813 [info] <0.234.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized startup delay.
2019-09-29 09:47:22.846 [info] <0.234.0> k8s endpoint listing returned nodes not yet ready: rabbitmq-0
2019-09-29 09:47:22.846 [info] <0.234.0> All discovered existing cluster peers: 
2019-09-29 09:47:22.846 [info] <0.234.0> Discovered no peer nodes to cluster with
2019-09-29 09:47:22.850 [info] <0.43.0> Application mnesia exited with reason: stopped
2019-09-29 09:47:23.063 [info] <0.234.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2019-09-29 09:47:23.104 [info] <0.234.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2019-09-29 09:47:23.154 [info] <0.234.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2019-09-29 09:47:23.154 [info] <0.234.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping registration.
2019-09-29 09:47:23.154 [info] <0.234.0> Running boot step database_sync defined by app rabbit
2019-09-29 09:47:23.155 [info] <0.234.0> Running boot step feature_flags defined by app rabbit
2019-09-29 09:47:23.155 [info] <0.234.0> Running boot step codec_correctness_check defined by app rabbit
2019-09-29 09:47:23.155 [info] <0.234.0> Running boot step external_infrastructure defined by app rabbit
2019-09-29 09:47:23.155 [info] <0.234.0> Running boot step rabbit_registry defined by app rabbit
2019-09-29 09:47:23.156 [info] <0.234.0> Running boot step rabbit_auth_mechanism_cr_demo defined by app rabbit
2019-09-29 09:47:23.156 [info] <0.234.0> Running boot step rabbit_queue_location_random defined by app rabbit
2019-09-29 09:47:23.156 [info] <0.234.0> Running boot step rabbit_event defined by app rabbit
2019-09-29 09:47:23.156 [info] <0.234.0> Running boot step rabbit_auth_mechanism_amqplain defined by app rabbit
2019-09-29 09:47:23.156 [info] <0.234.0> Running boot step rabbit_auth_mechanism_plain defined by app rabbit
2019-09-29 09:47:23.157 [info] <0.234.0> Running boot step rabbit_exchange_type_direct defined by app rabbit
2019-09-29 09:47:23.157 [info] <0.234.0> Running boot step rabbit_exchange_type_fanout defined by app rabbit
2019-09-29 09:47:23.157 [info] <0.234.0> Running boot step rabbit_exchange_type_headers defined by app rabbit
2019-09-29 09:47:23.157 [info] <0.234.0> Running boot step rabbit_exchange_type_topic defined by app rabbit
2019-09-29 09:47:23.158 [info] <0.234.0> Running boot step rabbit_mirror_queue_mode_all defined by app rabbit
2019-09-29 09:47:23.158 [info] <0.234.0> Running boot step rabbit_mirror_queue_mode_exactly defined by app rabbit
2019-09-29 09:47:23.158 [info] <0.234.0> Running boot step rabbit_mirror_queue_mode_nodes defined by app rabbit
2019-09-29 09:47:23.158 [info] <0.234.0> Running boot step rabbit_priority_queue defined by app rabbit
2019-09-29 09:47:23.158 [info] <0.234.0> Priority queues enabled, real BQ is rabbit_variable_queue
2019-09-29 09:47:23.158 [info] <0.234.0> Running boot step rabbit_queue_location_client_local defined by app rabbit
2019-09-29 09:47:23.158 [info] <0.234.0> Running boot step rabbit_queue_location_min_masters defined by app rabbit
2019-09-29 09:47:23.159 [info] <0.234.0> Running boot step kernel_ready defined by app rabbit
2019-09-29 09:47:23.159 [info] <0.234.0> Running boot step rabbit_sysmon_minder defined by app rabbit
2019-09-29 09:47:23.159 [info] <0.234.0> Running boot step rabbit_epmd_monitor defined by app rabbit
2019-09-29 09:47:23.161 [info] <0.234.0> Running boot step guid_generator defined by app rabbit
2019-09-29 09:47:23.166 [info] <0.234.0> Running boot step rabbit_node_monitor defined by app rabbit
2019-09-29 09:47:23.167 [info] <0.421.0> Starting rabbit_node_monitor
2019-09-29 09:47:23.167 [info] <0.234.0> Running boot step delegate_sup defined by app rabbit
2019-09-29 09:47:23.168 [info] <0.234.0> Running boot step rabbit_memory_monitor defined by app rabbit
2019-09-29 09:47:23.168 [info] <0.234.0> Running boot step core_initialized defined by app rabbit
2019-09-29 09:47:23.168 [info] <0.234.0> Running boot step upgrade_queues defined by app rabbit
2019-09-29 09:47:23.205 [info] <0.234.0> message_store upgrades: 1 to apply
2019-09-29 09:47:23.205 [info] <0.234.0> message_store upgrades: Applying rabbit_variable_queue:move_messages_to_vhost_store
2019-09-29 09:47:23.205 [info] <0.234.0> message_store upgrades: No durable queues found. Skipping message store migration
2019-09-29 09:47:23.205 [info] <0.234.0> message_store upgrades: Removing the old message store data
2019-09-29 09:47:23.206 [info] <0.234.0> message_store upgrades: All upgrades applied successfully
2019-09-29 09:47:23.245 [info] <0.234.0> Running boot step rabbit_connection_tracking defined by app rabbit
2019-09-29 09:47:23.245 [info] <0.234.0> Running boot step rabbit_connection_tracking_handler defined by app rabbit
2019-09-29 09:47:23.245 [info] <0.234.0> Running boot step rabbit_exchange_parameters defined by app rabbit
2019-09-29 09:47:23.245 [info] <0.234.0> Running boot step rabbit_mirror_queue_misc defined by app rabbit
2019-09-29 09:47:23.246 [info] <0.234.0> Running boot step rabbit_policies defined by app rabbit
2019-09-29 09:47:23.247 [info] <0.234.0> Running boot step rabbit_policy defined by app rabbit
2019-09-29 09:47:23.247 [info] <0.234.0> Running boot step rabbit_queue_location_validator defined by app rabbit
2019-09-29 09:47:23.247 [info] <0.234.0> Running boot step rabbit_vhost_limit defined by app rabbit
2019-09-29 09:47:23.247 [info] <0.234.0> Running boot step rabbit_mgmt_reset_handler defined by app rabbitmq_management
2019-09-29 09:47:23.247 [info] <0.234.0> Running boot step rabbit_mgmt_db_handler defined by app rabbitmq_management_agent
2019-09-29 09:47:23.247 [info] <0.234.0> Management plugin: using rates mode 'basic'
2019-09-29 09:47:23.248 [info] <0.234.0> Running boot step recovery defined by app rabbit
2019-09-29 09:47:23.249 [info] <0.234.0> Running boot step load_definitions defined by app rabbitmq_management
2019-09-29 09:47:23.249 [info] <0.234.0> Running boot step empty_db_check defined by app rabbit
2019-09-29 09:47:23.249 [info] <0.234.0> Adding vhost '/'
2019-09-29 09:47:23.294 [info] <0.462.0> Making sure data directory '/var/lib/rabbitmq/mnesia/rabbit@rabbitmq-0/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L' for vhost '/' exists
2019-09-29 09:47:23.301 [info] <0.462.0> Starting message stores for vhost '/'
2019-09-29 09:47:23.302 [info] <0.466.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_transient": using rabbit_msg_store_ets_index to provide index
2019-09-29 09:47:23.304 [info] <0.462.0> Started message store of type transient for vhost '/'
2019-09-29 09:47:23.304 [info] <0.469.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent": using rabbit_msg_store_ets_index to provide index
2019-09-29 09:47:23.305 [warning] <0.469.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent": rebuilding indices from scratch
2019-09-29 09:47:23.306 [info] <0.462.0> Started message store of type persistent for vhost '/'
2019-09-29 09:47:23.308 [info] <0.234.0> Creating user 'guest'
2019-09-29 09:47:23.313 [info] <0.234.0> Setting user tags for user 'guest' to [administrator]
2019-09-29 09:47:23.317 [info] <0.234.0> Setting permissions for 'guest' in '/' to '.*', '.*', '.*'
2019-09-29 09:47:23.322 [info] <0.234.0> Running boot step rabbit_looking_glass defined by app rabbit
2019-09-29 09:47:23.322 [info] <0.234.0> Running boot step rabbit_core_metrics_gc defined by app rabbit
2019-09-29 09:47:23.322 [info] <0.234.0> Running boot step background_gc defined by app rabbit
2019-09-29 09:47:23.323 [info] <0.234.0> Running boot step connection_tracking defined by app rabbit
2019-09-29 09:47:23.331 [info] <0.234.0> Setting up a table for connection tracking on this node: 'tracked_connection_on_node_rabbit@rabbitmq-0'
2019-09-29 09:47:23.338 [info] <0.234.0> Setting up a table for per-vhost connection counting on this node: 'tracked_connection_per_vhost_on_node_rabbit@rabbitmq-0'
2019-09-29 09:47:23.338 [info] <0.234.0> Running boot step routing_ready defined by app rabbit
2019-09-29 09:47:23.338 [info] <0.234.0> Running boot step pre_flight defined by app rabbit
2019-09-29 09:47:23.338 [info] <0.234.0> Running boot step notify_cluster defined by app rabbit
2019-09-29 09:47:23.338 [info] <0.234.0> Running boot step networking defined by app rabbit
2019-09-29 09:47:23.341 [info] <0.515.0> started TCP listener on [::]:5672
2019-09-29 09:47:23.342 [info] <0.234.0> Running boot step direct_client defined by app rabbit
2019-09-29 09:47:23.342 [info] <0.521.0> Peer discovery: enabling node cleanup (will only log warnings). Check interval: 30 seconds.
2019-09-29 09:47:23.386 [info] <0.575.0> Management plugin: HTTP (non-TLS) listener started on port 15672
2019-09-29 09:47:23.386 [info] <0.681.0> Statistics database started.
2019-09-29 09:47:23.386 [info] <0.680.0> Starting worker pool 'management_worker_pool' with 3 processes in it
2019-09-29 09:47:23.602 [info] <0.8.0> Server startup complete; 5 plugins started.
 * rabbitmq_management
 * rabbitmq_management_agent
 * rabbitmq_web_dispatch
 * rabbitmq_peer_discovery_k8s
 * rabbitmq_peer_discovery_common
 completed with 5 plugins.

rabbitmq-1's log

2019-09-29 09:48:26.925 [info] <0.8.0> Feature flags: list of feature flags found:
2019-09-29 09:48:26.925 [info] <0.8.0> Feature flags: feature flag states written to disk: yes
2019-09-29 09:48:26.974 [info] <0.234.0> 
 Starting RabbitMQ 3.7.18 on Erlang 22.1
 Copyright (C) 2007-2019 Pivotal Software, Inc.
 Licensed under the MPL.  See https://www.rabbitmq.com/

  ##  ##
  ##  ##      RabbitMQ 3.7.18. Copyright (C) 2007-2019 Pivotal Software, Inc.
  ##########  Licensed under the MPL.  See https://www.rabbitmq.com/
  ######  ##
  ##########  Logs: <stdout>

              Starting broker...
2019-09-29 09:48:26.975 [info] <0.234.0> 
 node           : rabbit@rabbitmq-1
 home dir       : /var/lib/rabbitmq
 config file(s) : /etc/rabbitmq/rabbitmq.conf
 cookie hash    : XhdCf8zpVJeJ0EHyaxszPg==
 log(s)         : <stdout>
 database dir   : /var/lib/rabbitmq/mnesia/rabbit@rabbitmq-1
2019-09-29 09:48:27.000 [info] <0.234.0> Running boot step pre_boot defined by app rabbit
2019-09-29 09:48:27.000 [info] <0.234.0> Running boot step rabbit_core_metrics defined by app rabbit
2019-09-29 09:48:27.001 [info] <0.234.0> Running boot step rabbit_alarm defined by app rabbit
2019-09-29 09:48:27.008 [info] <0.240.0> Memory high watermark set to 1907 MiB (2000000000 bytes) of 3790 MiB (3974975488 bytes) total
2019-09-29 09:48:27.015 [info] <0.242.0> Enabling free disk space monitoring
2019-09-29 09:48:27.015 [info] <0.242.0> Disk free limit set to 4000MB
2019-09-29 09:48:27.019 [info] <0.234.0> Running boot step code_server_cache defined by app rabbit
2019-09-29 09:48:27.019 [info] <0.234.0> Running boot step file_handle_cache defined by app rabbit
2019-09-29 09:48:27.020 [info] <0.245.0> Limiting to approx 65436 file handles (58890 sockets)
2019-09-29 09:48:27.020 [info] <0.246.0> FHC read buffering:  OFF
2019-09-29 09:48:27.020 [info] <0.246.0> FHC write buffering: ON
2019-09-29 09:48:27.020 [info] <0.234.0> Running boot step worker_pool defined by app rabbit
2019-09-29 09:48:27.021 [info] <0.235.0> Will use 2 processes for default worker pool
2019-09-29 09:48:27.021 [info] <0.235.0> Starting worker pool 'worker_pool' with 2 processes in it
2019-09-29 09:48:27.021 [info] <0.234.0> Running boot step database defined by app rabbit
2019-09-29 09:48:27.021 [info] <0.234.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@rabbitmq-1 is empty. Assuming we need to join an existing cluster or initialise from scratch...
2019-09-29 09:48:27.021 [info] <0.234.0> Configured peer discovery backend: rabbit_peer_discovery_k8s
2019-09-29 09:48:27.022 [info] <0.234.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s
2019-09-29 09:48:27.022 [info] <0.234.0> Peer discovery backend does not support locking, falling back to randomized delay
2019-09-29 09:48:27.022 [info] <0.234.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized startup delay.
2019-09-29 09:48:27.051 [info] <0.234.0> k8s endpoint listing returned nodes not yet ready: rabbitmq-1
2019-09-29 09:48:27.052 [info] <0.234.0> All discovered existing cluster peers: rabbit@rabbitmq-0
2019-09-29 09:48:27.052 [info] <0.234.0> Peer nodes we can cluster with: rabbit@rabbitmq-0
2019-09-29 09:48:33.069 [warning] <0.234.0> Could not auto-cluster with node rabbit@rabbitmq-0: {badrpc,nodedown}
2019-09-29 09:48:33.069 [warning] <0.234.0> Could not successfully contact any node of: rabbit@rabbitmq-0 (as in Erlang distribution). Starting as a blank standalone node...
2019-09-29 09:48:33.077 [info] <0.43.0> Application mnesia exited with reason: stopped
2019-09-29 09:48:33.206 [info] <0.234.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2019-09-29 09:48:33.255 [info] <0.234.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2019-09-29 09:48:33.303 [info] <0.234.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2019-09-29 09:48:33.304 [info] <0.234.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping registration.
2019-09-29 09:48:33.304 [info] <0.234.0> Running boot step database_sync defined by app rabbit
2019-09-29 09:48:33.304 [info] <0.234.0> Running boot step feature_flags defined by app rabbit
2019-09-29 09:48:33.304 [info] <0.234.0> Running boot step codec_correctness_check defined by app rabbit
2019-09-29 09:48:33.304 [info] <0.234.0> Running boot step external_infrastructure defined by app rabbit
2019-09-29 09:48:33.304 [info] <0.234.0> Running boot step rabbit_registry defined by app rabbit
2019-09-29 09:48:33.305 [info] <0.234.0> Running boot step rabbit_auth_mechanism_cr_demo defined by app rabbit
2019-09-29 09:48:33.305 [info] <0.234.0> Running boot step rabbit_queue_location_random defined by app rabbit
2019-09-29 09:48:33.305 [info] <0.234.0> Running boot step rabbit_event defined by app rabbit
2019-09-29 09:48:33.305 [info] <0.234.0> Running boot step rabbit_auth_mechanism_amqplain defined by app rabbit
2019-09-29 09:48:33.305 [info] <0.234.0> Running boot step rabbit_auth_mechanism_plain defined by app rabbit
2019-09-29 09:48:33.305 [info] <0.234.0> Running boot step rabbit_exchange_type_direct defined by app rabbit
2019-09-29 09:48:33.305 [info] <0.234.0> Running boot step rabbit_exchange_type_fanout defined by app rabbit
2019-09-29 09:48:33.306 [info] <0.234.0> Running boot step rabbit_exchange_type_headers defined by app rabbit
2019-09-29 09:48:33.306 [info] <0.234.0> Running boot step rabbit_exchange_type_topic defined by app rabbit
2019-09-29 09:48:33.306 [info] <0.234.0> Running boot step rabbit_mirror_queue_mode_all defined by app rabbit
2019-09-29 09:48:33.306 [info] <0.234.0> Running boot step rabbit_mirror_queue_mode_exactly defined by app rabbit
2019-09-29 09:48:33.306 [info] <0.234.0> Running boot step rabbit_mirror_queue_mode_nodes defined by app rabbit
2019-09-29 09:48:33.306 [info] <0.234.0> Running boot step rabbit_priority_queue defined by app rabbit
2019-09-29 09:48:33.307 [info] <0.234.0> Priority queues enabled, real BQ is rabbit_variable_queue
2019-09-29 09:48:33.307 [info] <0.234.0> Running boot step rabbit_queue_location_client_local defined by app rabbit
2019-09-29 09:48:33.307 [info] <0.234.0> Running boot step rabbit_queue_location_min_masters defined by app rabbit
2019-09-29 09:48:33.307 [info] <0.234.0> Running boot step kernel_ready defined by app rabbit
2019-09-29 09:48:33.307 [info] <0.234.0> Running boot step rabbit_sysmon_minder defined by app rabbit
2019-09-29 09:48:33.307 [info] <0.234.0> Running boot step rabbit_epmd_monitor defined by app rabbit
2019-09-29 09:48:33.308 [info] <0.234.0> Running boot step guid_generator defined by app rabbit
2019-09-29 09:48:33.313 [info] <0.234.0> Running boot step rabbit_node_monitor defined by app rabbit
2019-09-29 09:48:33.313 [info] <0.421.0> Starting rabbit_node_monitor
2019-09-29 09:48:33.313 [info] <0.234.0> Running boot step delegate_sup defined by app rabbit
2019-09-29 09:48:33.314 [info] <0.234.0> Running boot step rabbit_memory_monitor defined by app rabbit
2019-09-29 09:48:33.314 [info] <0.234.0> Running boot step core_initialized defined by app rabbit
2019-09-29 09:48:33.314 [info] <0.234.0> Running boot step upgrade_queues defined by app rabbit
2019-09-29 09:48:33.355 [info] <0.234.0> message_store upgrades: 1 to apply
2019-09-29 09:48:33.355 [info] <0.234.0> message_store upgrades: Applying rabbit_variable_queue:move_messages_to_vhost_store
2019-09-29 09:48:33.356 [info] <0.234.0> message_store upgrades: No durable queues found. Skipping message store migration
2019-09-29 09:48:33.356 [info] <0.234.0> message_store upgrades: Removing the old message store data
2019-09-29 09:48:33.356 [info] <0.234.0> message_store upgrades: All upgrades applied successfully
2019-09-29 09:48:33.402 [info] <0.234.0> Running boot step rabbit_connection_tracking defined by app rabbit
2019-09-29 09:48:33.402 [info] <0.234.0> Running boot step rabbit_connection_tracking_handler defined by app rabbit
2019-09-29 09:48:33.402 [info] <0.234.0> Running boot step rabbit_exchange_parameters defined by app rabbit
2019-09-29 09:48:33.403 [info] <0.234.0> Running boot step rabbit_mirror_queue_misc defined by app rabbit
2019-09-29 09:48:33.403 [info] <0.234.0> Running boot step rabbit_policies defined by app rabbit
2019-09-29 09:48:33.404 [info] <0.234.0> Running boot step rabbit_policy defined by app rabbit
2019-09-29 09:48:33.405 [info] <0.234.0> Running boot step rabbit_queue_location_validator defined by app rabbit
2019-09-29 09:48:33.405 [info] <0.234.0> Running boot step rabbit_vhost_limit defined by app rabbit
2019-09-29 09:48:33.405 [info] <0.234.0> Running boot step rabbit_mgmt_reset_handler defined by app rabbitmq_management
2019-09-29 09:48:33.405 [info] <0.234.0> Running boot step rabbit_mgmt_db_handler defined by app rabbitmq_management_agent
2019-09-29 09:48:33.405 [info] <0.234.0> Management plugin: using rates mode 'basic'
2019-09-29 09:48:33.405 [info] <0.234.0> Running boot step recovery defined by app rabbit
2019-09-29 09:48:33.407 [info] <0.234.0> Running boot step load_definitions defined by app rabbitmq_management
2019-09-29 09:48:33.407 [info] <0.234.0> Running boot step empty_db_check defined by app rabbit
2019-09-29 09:48:33.407 [info] <0.234.0> Adding vhost '/'
2019-09-29 09:48:33.433 [info] <0.462.0> Making sure data directory '/var/lib/rabbitmq/mnesia/rabbit@rabbitmq-1/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L' for vhost '/' exists
2019-09-29 09:48:33.439 [info] <0.462.0> Starting message stores for vhost '/'
2019-09-29 09:48:33.440 [info] <0.466.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_transient": using rabbit_msg_store_ets_index to provide index
2019-09-29 09:48:33.441 [info] <0.462.0> Started message store of type transient for vhost '/'
2019-09-29 09:48:33.441 [info] <0.469.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent": using rabbit_msg_store_ets_index to provide index
2019-09-29 09:48:33.442 [warning] <0.469.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent": rebuilding indices from scratch
2019-09-29 09:48:33.443 [info] <0.462.0> Started message store of type persistent for vhost '/'
2019-09-29 09:48:33.445 [info] <0.234.0> Creating user 'guest'
2019-09-29 09:48:33.448 [info] <0.234.0> Setting user tags for user 'guest' to [administrator]
2019-09-29 09:48:33.452 [info] <0.234.0> Setting permissions for 'guest' in '/' to '.*', '.*', '.*'
2019-09-29 09:48:33.455 [info] <0.234.0> Running boot step rabbit_looking_glass defined by app rabbit
2019-09-29 09:48:33.455 [info] <0.234.0> Running boot step rabbit_core_metrics_gc defined by app rabbit
2019-09-29 09:48:33.456 [info] <0.234.0> Running boot step background_gc defined by app rabbit
2019-09-29 09:48:33.456 [info] <0.234.0> Running boot step connection_tracking defined by app rabbit
2019-09-29 09:48:33.461 [info] <0.234.0> Setting up a table for connection tracking on this node: 'tracked_connection_on_node_rabbit@rabbitmq-1'
2019-09-29 09:48:33.465 [info] <0.234.0> Setting up a table for per-vhost connection counting on this node: 'tracked_connection_per_vhost_on_node_rabbit@rabbitmq-1'
2019-09-29 09:48:33.465 [info] <0.234.0> Running boot step routing_ready defined by app rabbit
2019-09-29 09:48:33.466 [info] <0.234.0> Running boot step pre_flight defined by app rabbit
2019-09-29 09:48:33.466 [info] <0.234.0> Running boot step notify_cluster defined by app rabbit
2019-09-29 09:48:33.466 [info] <0.234.0> Running boot step networking defined by app rabbit
2019-09-29 09:48:33.468 [info] <0.515.0> started TCP listener on [::]:5672
2019-09-29 09:48:33.468 [info] <0.234.0> Running boot step direct_client defined by app rabbit
2019-09-29 09:48:33.469 [info] <0.521.0> Peer discovery: enabling node cleanup (will only log warnings). Check interval: 30 seconds.
2019-09-29 09:48:33.520 [info] <0.575.0> Management plugin: HTTP (non-TLS) listener started on port 15672
2019-09-29 09:48:33.521 [info] <0.681.0> Statistics database started.
2019-09-29 09:48:33.521 [info] <0.680.0> Starting worker pool 'management_worker_pool' with 3 processes in it
 completed with 5 plugins.
2019-09-29 09:48:33.791 [info] <0.8.0> Server startup complete; 5 plugins started.
 * rabbitmq_management
 * rabbitmq_management_agent
 * rabbitmq_web_dispatch
 * rabbitmq_peer_discovery_k8s
 * rabbitmq_peer_discovery_common

rabbitmq cluster_status

rabbitmq-0
root@rabbitmq-0:/# rabbitmqctl cluster_status
Cluster status of node rabbit@rabbitmq-0 ...
[{nodes,[{disc,['rabbit@rabbitmq-0']}]},
 {running_nodes,['rabbit@rabbitmq-0']},
 {cluster_name,<<"rabbit@rabbitmq-0.rabbitmq-headless-srv.default.svc.cluster.local.">>},
 {partitions,[]},
 {alarms,[{'rabbit@rabbitmq-0',[]}]}]

rabbitmq-1
root@rabbitmq-1:/# rabbitmqctl cluster_status
Cluster status of node rabbit@rabbitmq-1 ...
[{nodes,[{disc,['rabbit@rabbitmq-1']}]},
 {running_nodes,['rabbit@rabbitmq-1']},
 {cluster_name,<<"rabbit@rabbitmq-1.rabbitmq-headless-srv.default.svc.cluster.local.">>},
 {partitions,[]},
 {alarms,[{'rabbit@rabbitmq-1',[]}]}]

rabbitmq_configmap.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: rabbitmq-config
  namespace: default
data:
  enabled_plugins: |
      [rabbitmq_management,rabbitmq_peer_discovery_k8s].

  rabbitmq.conf: |
      ## Cluster formation. See https://www.rabbitmq.com/cluster-formation.html to learn more.
      cluster_formation.peer_discovery_backend  = rabbit_peer_discovery_k8s
      cluster_formation.k8s.host = kubernetes.default.svc.cluster.local
      #cluster_formation.k8s.host = 10.254.0.1
      cluster_formation.k8s.port = 443
      cluster_formation.k8s.scheme = https
      cluster_formation.k8s.cert_path = /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      cluster_formation.k8s.token_path = /var/run/secrets/kubernetes.io/serviceaccount/token
      cluster_formation.k8s.namespace_path = /var/run/secrets/kubernetes.io/serviceaccount/namespace
      cluster_formation.randomized_startup_delay_range.min = 0
      cluster_formation.randomized_startup_delay_range.max = 2
      # 必须设置service_name，否则Pod无法正常启动，这里设置后可以不设置statefulset下env中的K8S_SERVICE_NAME变量
      cluster_formation.k8s.service_name = rabbitmq-headless-srv
      # 必须设置hostname_suffix，否则节点不能成为集群
      cluster_formation.k8s.hostname_suffix = .rabbitmq-headless-srv.default.svc.cluster.local
      ## Should RabbitMQ node name be computed from the pod's hostname or IP address?
      ## IP addresses are not stable, so using [stable] hostnames is recommended when possible.
      ## Set to "hostname" to use pod hostnames.
      ## When this value is changed, so should the variable used to set the RABBITMQ_NODENAME
      ## environment variable.
      cluster_formation.k8s.address_type = hostname
      ## How often should node cleanup checks run?
      cluster_formation.node_cleanup.interval = 30
      ## Set to false if automatic removal of unknown/absent nodes
      ## is desired. This can be dangerous, see
      ##  * https://www.rabbitmq.com/cluster-formation.html#node-health-checks-and-cleanup
      ##  * https://groups.google.com/forum/#!msg/rabbitmq-users/wuOfzEywHXo/k8z_HWIkBgAJ
      cluster_formation.node_cleanup.only_log_warning = true
      cluster_partition_handling = autoheal
      ## See https://www.rabbitmq.com/ha.html#master-migration-data-locality
      queue_master_locator=min-masters
      ## See https://www.rabbitmq.com/access-control.html#loopback-users
      loopback_users.guest = false
      #the memory limit
      vm_memory_high_watermark.absolute = 2GB
      #the disk limit
      disk_free_limit.absolute = 4GB

rabbitmq_statefulsets.yaml

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: rabbitmq
  namespace: default
spec:
  selector:
    matchLabels:
      app: rabbitmq
  serviceName: rabbitmq-headless-srv
  replicas: 2
  template:
    metadata:
      labels:
        app: rabbitmq
    spec:
      serviceAccountName: rabbitmq
      terminationGracePeriodSeconds: 10
      containers:
      - name: rabbitmq
        image: rabbitmq:k8s-318
        resources:
          limits:
            cpu: 1
            memory: 2Gi
          requests:
            cpu: 0.5
            memory: 1Gi
        volumeMounts:
          - name: config-volume
            mountPath: /etc/rabbitmq
          - name: rabbitmq-pvc
            mountPath: /var/lib/rabbitmq/mnesia
        ports:
          - name: http
            protocol: TCP
            containerPort: 15672
          - name: amqp
            protocol: TCP
            containerPort: 5672
        livenessProbe:
          exec:
            command: ["rabbitmqctl", "status"]
          initialDelaySeconds: 60
          # See https://www.rabbitmq.com/monitoring.html for monitoring frequency recommendations.
          periodSeconds: 60
          timeoutSeconds: 5
        readinessProbe:
          exec:
            command: ["rabbitmqctl", "status"]
          initialDelaySeconds: 20
          periodSeconds: 60
          timeoutSeconds: 10
        imagePullPolicy: IfNotPresent
        env:
          - name: MY_POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: MY_POD_NAMESPACE
            valueFrom:
              fieldRef:
                fieldPath: metadata.namespace
          - name: RABBITMQ_USE_LONGNAME
            value: "false"
          - name: K8S_SERVICE_NAME
            value: "rabbitmq-headless-srv"
          - name: RABBITMQ_NODENAME
            #value: rabbit@$(MY_POD_NAME).$(K8S_SERVICE_NAME).$(MY_POD_NAMESPACE).svc.cluster.local
            value: rabbit@$(MY_POD_NAME)
          - name: K8S_HOSTNAME_SUFFIX
            value: ".$(K8S_SERVICE_NAME).$(MY_POD_NAMESPACE).svc.cluster.local"
          - name: RABBITMQ_ERLANG_COOKIE
            value: "mycookie"
      volumes:
      - name: config-volume
        configMap:
          name: rabbitmq-config
          items:
          - key: rabbitmq.conf
            path: rabbitmq.conf
          - key: enabled_plugins
            path: enabled_plugins
      - name: rabbitmq-pvc
        hostPath: 
          path: /pacloud/k8s/rabbitmq

Support nonstandard node name (hostname) in endpoint payload

Currently hostname lookup fails with OpenShift, because the OpenShift API doesn't support hostname in Endpoint.Adresses:
https://docs.openshift.com/container-platform/3.10/rest_api/api/v1.Endpoints.html

api call result:

2018-08-13 11:00:40.513 [debug] <0.226.0> Response: {ok,{{"HTTP/1.1",200,"OK"},[{"cache-control","no-store"},{"date","Mon, 13 Aug 2018 11:00:40 GMT"},{"content-length","1194"},{"content-type","application/json"}],"{\"kind\":\"Endpoints\",\"apiVersion\":\"v1\",\"metadata\":{\"name\":\"rabbitmq\",\"namespace\":\"dcrpi-omsf-dev0\",\"selfLink\":\"/api/v1/namespaces/dcrpi-omsf-dev0/endpoints/rabbitmq\",\"uid\":\"c8b25808-9a34-11e8-add2-02bdc501845b\",\"resourceVersion\":\"239344116\",\"creationTimestamp\":\"2018-08-07T11:26:53Z\",\"labels\":{\"app\":\"rabbitmq\"}},\"subsets\":[{\"addresses\":[{\"ip\":\"1.240.1.247\",\"nodeName\":\"ip-10-26-202-167.eu-central-1.compute.internal\",\"targetRef\":{\"kind\":\"Pod\",\"namespace\":\"dcrpi-omsf-dev0\",\"name\":\"rabbitmq-53-b6mr2\",\"uid\":\"18df67e0-9ed2-11e8-add2-02bdc501845b\",\"resourceVersion\":\"239250788\"}},{\"ip\":\"1.240.28.238\",\"nodeName\":\"ip-10-26-200-94.eu-central-1.compute.internal\",\"targetRef\":{\"kind\":\"Pod\",\"namespace\":\"dcrpi-omsf-dev0\",\"name\":\"rabbitmq-53-sgpzt\",\"uid\":\"23ef292f-9ed2-11e8-add2-02bdc501845b\",\"resourceVersion\":\"239251119\"}}],\"notReadyAddresses\":[{\"ip\":\"1.240.65.187\",\"nodeName\":\"ip-10-26-204-161.eu-central-1.compute.internal\",\"targetRef\":{\"kind\":\"Pod\",\"namespace\":\"dcrpi-omsf-dev0\",\"name\":\"rabbitmq-59-hcjm4\",\"uid\":\"12cb12fe-9ee8-11e8-add2-02bdc501845b\",\"resourceVersion\":\"239344115\"}}],\"ports\":[{\"name\":\"15671-tcp\",\"port\":15671,\"protocol\":\"TCP\"},{\"name\":\"5671-tcp\",\"port\":5671,\"protocol\":\"TCP\"}]}]}\n"}}

error message:

2018-08-13 11:00:40.514 [error] <0.225.0> CRASH REPORT Process <0.225.0> with 0 neighbours exited with reason: {{badkey,<<"hostname">>},[{maps,get,[<<"hostname">>,#{<<"ip">> => <<"1.240.65.187">>,<<"nodeName">> => <<"ip-10-26-204-161.eu-central-1.compute.internal">>,<<"targetRef">> => #{<<"kind">> => <<"Pod">>,<<"name">> => <<"rabbitmq-59-hcjm4">>,<<"namespace">> => <<"dcrpi-omsf-dev0">>,<<"resourceVersion">> => <<"239344115">>,<<"uid">> => <<"12cb12fe-9ee8-11e8-add2-02bdc501845b">>}}],[]},{rabbit_peer_discovery_k8s,get_address,1,[{file,"src/rabbit_peer_discovery_k8s.erl"},{line,172}]},{rabbit_peer_discovery_k8s,...},...]} in application_master:init/4 line 138
```

Please add a new cluster_formation.k8s.address_type option "pod" to use the podname instead of hostname. cluster_formation.k8s.hostname_suffix should also be considered.
Address-Type "ip" is no option if you are using cluster TLS encryption.

Recovering problem with persistent storage enabled

Kubernetes deployment based on rabbitmq-peer-discovery-k8s and persistent storage has problem with "inconsistent_cluster" boot after crash

Steps to reproduce and more info here:

helm/charts#4474
helm/charts#4519

is there any rabbitmq operator in k8s

hi,all
i want to create rabbitmq stateful Set cluster in k8s by using k8s operator controller, is there any demo project?

Why my rabbitmq node cannot discovery?

Conf

Openshift 3.9 Env
rbac.yml

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: rabbitmq
  namespace: {{ namespace }}
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: endpoint-reader
  namespace: {{ namespace }}
rules:
- apiGroups: [""]
  resources: ["endpoints"]
  verbs: ["get"]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: endpoint-reader
  namespace: {{ namespace }}
subjects:
- kind: ServiceAccount
  name: rabbitmq
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: endpoint-reader

statefulset.yml

---
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
  name: rabbitmq
  namespace: {{ namespace }}
spec:
  serviceName: rabbitmq
  replicas: 3
  volumeClaimTemplates:
    - metadata:
        name: rabbitmq-data
      spec:
        storageClassName: {{ sc_name }}
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 100Gi
  template:
    metadata:
      labels:
        app: rabbitmq
    spec:
      serviceAccountName: rabbitmq
      terminationGracePeriodSeconds: 10
      containers:
      - name: rabbitmq-k8s
        image: rabbitmq:3.7
        volumeMounts:
          - name: config-volume
            mountPath: /etc/rabbitmq
          - name: rabbitmq-data
            mountPath: /var/lib/rabbitmq
          - name: rabbitmq-logs
            mountPath: /var/log/rabbitmq
        ports:
          - name: http
            protocol: TCP
            containerPort: 15672
          - name: amqp
            protocol: TCP
            containerPort: 5672
        livenessProbe:
          exec:
            command: ["rabbitmqctl", "status"]
          initialDelaySeconds: 20
          periodSeconds: 30
          timeoutSeconds: 10
        readinessProbe:
          exec:
            command: ["rabbitmqctl", "status"]
          initialDelaySeconds: 20
          periodSeconds: 30
          timeoutSeconds: 10
        imagePullPolicy: IfNotPresent
        subdomain: mq
        env:
          - name: MY_POD_HOSTNAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
                #fieldPath: status.podIP
          - name: RABBITMQ_USE_LONGNAME
            value: "true"
          - name: K8S_SERVICE_NAME
            value: "rabbitmq"
          # See a note on cluster_formation.k8s.address_type in the config file section
          - name: RABBITMQ_NODENAME
            value: "rabbit@$(MY_POD_HOSTNAME).rabbitmq.{{ namespace }}.svc.cluster.local"
          - name: RABBITMQ_ERLANG_COOKIE
            value: "mycookie"
          - name: K8S_HOSTNAME_SUFFIX
            value: ".rabbitmq.{{ namespace }}.svc.cluster.local"
      volumes:
        - name: rabbitmq-logs
          emptyDir: {}
        - name: config-volume
          configMap:
            name: rabbitmq-configmap
            items:
            - key: rabbitmq_conf
              path: rabbitmq.conf
            - key: enabled_plugins
              path: enabled_plugins
            - key: definitions
              path: definitions.json

configmaps with file

---------- enable_plugins
[rabbitmq_management,rabbitmq_peer_discovery_k8s].

--------- rabbitmq.conf
log.console = true
log.console.level = debug
cluster_formation.peer_discovery_backend  = rabbit_peer_discovery_k8s
cluster_formation.k8s.host = kubernetes.default.svc.cluster.local
cluster_formation.k8s.address_type = hostname
cluster_formation.node_cleanup.interval = 30
cluster_formation.node_cleanup.only_log_warning = true
cluster_partition_handling = autoheal
queue_master_locator=min-masters
loopback_users.guest = false
management.load_definitions = /etc/rabbitmq/definitions.json

--------- definitions.json
{
  "vhosts": [
    {"name": "/"}
  ],
  "users": [
    {"name": "guest", "password": "guest", "tags": "administrator"}
  ],
  "permissions": [
    {"user": "guest", "vhost": "/", "configure": ".*", "write": ".*", "read": ".*"},
  ],
  "policies":[
    {"vhost": "/", "name": "ha-all", "pattern": ".*", "apply-to": "all", "definition": {"ha-mode":"all","ha-sync-mode":"automatic"}, "priority":0},
  ]
}

Question

I start 3 replicas, and pvc bound normal. Every node is normal. But they cannot discovery each other.

---------- rabbitmq-0
$ rabbitmqctl cluster_status
Cluster status of node [email protected] ...
[{nodes,[{disc,['[email protected]']}]},
 {running_nodes,['[email protected]']},
 {cluster_name,<<"[email protected]">>},
 {partitions,[]},
 {alarms,[{'[email protected]',[]}]}]

-------- rabbitmq-1
Cluster status of node [email protected] ...
[{nodes,[{disc,['[email protected]']}]},
 {running_nodes,['[email protected]']},
 {cluster_name,<<"rabbit@lg-dev-web">>},
 {partitions,[]},
 {alarms,[{'[email protected]',[]}]}]

-------- rabbitmq-2
Cluster status of node [email protected] ...
[{nodes,[{disc,['[email protected]']}]},
 {running_nodes,['[email protected]']},
 {cluster_name,<<"[email protected]">>},
 {partitions,[]},
 {alarms,[{'[email protected]',[]}]}]

Documentation inconsistency regarding randomized startup delay

According to the documentation this plugin uses a randomized startup delay:

This backend relies on randomized startup delay to reduce the probability of a race condition during initial cluster formation (see below).

The logs of node startup don't support this statement:

2018-04-10 16:53:01.887 [info] &lt;0.189.0&gt; Peer discovery backend does not support locking, falling back to randomized delay
2018-04-10 16:53:01.888 [info] &lt;0.189.0&gt; Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized startup delay.

And if the only supported method of running a RabbitMQ cluster in Kubernetes is a StatefulSet, a randomized startup delay isn't necessary because Kubernetes will start Pods one after another (taken from the documentation):

The StatefulSet controller starts Pods one at a time, in order by their ordinal index. It waits until each Pod reports being Ready before starting the next one.

Read-only file system

Hi!

When I try to create my rabbitmq in my cluster, I get the following config-map error.

sed: can't create temp file '/etc/rabbitmq/rabbitmq.confXXXXXX': Read-only file system

What can I do to solve the problem?

Release 3.7.0 version

Hi guys,

Is it possible to release a new version on https://hub.docker.com/r/pivotalrabbitmq/rabbitmq-autocluster/tags/.

Only the beta tag is available.

Thanks!

RabbitMQ Pod fails with `nxdomain`

2018-01-29 05:48:55.120 [info] <0.33.0> Application lager started on node '[email protected]'
2018-01-29 05:48:55.621 [info] <0.33.0> Application inets started on node '[email protected]'
2018-01-29 05:48:55.622 [info] <0.33.0> Application crypto started on node '[email protected]'
2018-01-29 05:48:55.628 [info] <0.33.0> Application os_mon started on node '[email protected]'
2018-01-29 05:48:55.629 [info] <0.33.0> Application recon started on node '[email protected]'
2018-01-29 05:48:55.787 [info] <0.33.0> Application mnesia started on node '[email protected]'
2018-01-29 05:48:55.788 [info] <0.33.0> Application cowlib started on node '[email protected]'
2018-01-29 05:48:55.788 [info] <0.33.0> Application jsx started on node '[email protected]'
2018-01-29 05:48:55.788 [info] <0.33.0> Application xmerl started on node '[email protected]'
2018-01-29 05:48:55.788 [info] <0.33.0> Application asn1 started on node '[email protected]'
2018-01-29 05:48:55.788 [info] <0.33.0> Application public_key started on node '[email protected]'
2018-01-29 05:48:55.851 [info] <0.33.0> Application ssl started on node '[email protected]'
2018-01-29 05:48:55.856 [info] <0.33.0> Application ranch started on node '[email protected]'
2018-01-29 05:48:55.856 [info] <0.33.0> Application ranch_proxy_protocol started on node '[email protected]'
2018-01-29 05:48:55.858 [info] <0.33.0> Application cowboy started on node '[email protected]'
2018-01-29 05:48:55.858 [info] <0.33.0> Application rabbit_common started on node '[email protected]'
2018-01-29 05:48:55.866 [info] <0.33.0> Application amqp_client started on node '[email protected]'
2018-01-29 05:48:55.876 [info] <0.193.0>
Starting RabbitMQ 3.7.2 on Erlang 20.1.7
Copyright (C) 2007-2017 Pivotal Software, Inc.
Licensed under the MPL. See http://www.rabbitmq.com/

########## Licensed under the MPL. See http://www.rabbitmq.com/

########## Logs:

          Starting broker...

2018-01-29 05:48:55.893 [info] <0.193.0>
node : [email protected]
home dir : /var/lib/rabbitmq
config file(s) : /etc/rabbitmq/rabbitmq.conf
cookie hash : XhdCf8zpVJeJ0EHyaxszPg==
log(s) :
database dir : /var/lib/rabbitmq/mnesia/[email protected]
2018-01-29 05:48:58.148 [info] <0.201.0> Memory high watermark set to 1580 MiB (1657449676 bytes) of 3951 MiB (4143624192 bytes) total
2018-01-29 05:48:58.153 [info] <0.203.0> Enabling free disk space monitoring
2018-01-29 05:48:58.153 [info] <0.203.0> Disk free limit set to 50MB
2018-01-29 05:48:58.156 [info] <0.205.0> Limiting to approx 1048476 file handles (943626 sockets)
2018-01-29 05:48:58.156 [info] <0.206.0> FHC read buffering: OFF
2018-01-29 05:48:58.156 [info] <0.206.0> FHC write buffering: ON
2018-01-29 05:48:58.157 [info] <0.193.0> Node database directory at /var/lib/rabbitmq/mnesia/[email protected] is empty. Assuming we need to join an existing cluster or initialise from scratch...
2018-01-29 05:48:58.157 [info] <0.193.0> Configured peer discovery backend: rabbit_peer_discovery_k8s
2018-01-29 05:48:58.157 [info] <0.193.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s
2018-01-29 05:48:58.157 [info] <0.193.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized startup delay.
2018-01-29 05:49:06.159 [info] <0.193.0> Failed to get nodes from k8s - {failed_connect,[{to_address,{"kubernetes.default.svc.cluster.local",443}},
{inet,[inet],nxdomain}]}
2018-01-29 05:49:06.159 [error] <0.192.0> CRASH REPORT Process <0.192.0> with 0 neighbours exited with reason: no case clause matching {error,"{failed_connect,[{to_address,{"kubernetes.default.svc.cluster.local",443}},\n {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from_config/0 line 163 in application_master:init/4 line 134
2018-01-29 05:49:06.160 [info] <0.33.0> Application rabbit exited with reason: no case clause matching {error,"{failed_connect,[{to_address,{"kubernetes.default.svc.cluster.local",443}},\n {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from_config/0 line 163

k8s_statefulsets livenessProbe and readinessProbe incorrectly use command: ["rabbitmqctl", "status"]

https://github.com/rabbitmq/rabbitmq-peer-discovery-k8s/blob/master/examples/k8s_statefulsets/rabbitmq.yaml

When reporting readiness and liveness to K8s in a StatefulSet one should not report healthy until the rabbit node is joined to the other rabbit nodes in the cluster. The status will report healthy even in the case where there is a wrong erlang cookie and the node was unable to join the cluster. This is a bug and should be addressed by using a different rabbit command that returns health based on cluster membership status as well.

What I am not suggesting is rabbit have any knowledge of K8s at all, I'm suggesting that rabbit should have awareness of its own cluster state no matter where it's running and have the ability for an individual node to report on its own rabbit cluster membership state even when running outside of any orchestrator.

Ideally, there would be a rabbitmqctl command that would return a non-zero (0) exit code when it fails to join any other members of a rabbit cluster, completely unrelated to K8s or any other orchestrator. This command then could be used for the readinessProbe in K8s.

I'm running on K8s v1.8.2 and using rabbitmq:3.7-alpine docker image.

I started a 3 node cluster with a randomly generated erlang secret, then I upscaled that cluster to 5 but the two new nodes had a different randomly generated erlang secret.

You can see the nodes fail to join in the logs from the original 3 node cluster.

2017-12-19 15:26:44.957 [error] <0.8251.0> ** Connection attempt from disallowed node '[email protected]' **
2017-12-19 15:27:08.206 [error] <0.8324.0> ** Connection attempt from disallowed node '[email protected]' **

rabbitmqctl status

bash-4.4# rabbitmqctl status
Status of node [email protected] ...
[{pid,344},
 {running_applications,
     [{rabbitmq_federation_management,"RabbitMQ Federation Management",
          "3.7.0"},
      {rabbitmq_federation,"RabbitMQ Federation","3.7.0"},
      {rabbitmq_consistent_hash_exchange,"Consistent Hash Exchange Type",
          "3.7.0"},
      {rabbitmq_shovel_management,
          "Management extension for the Shovel plugin","3.7.0"},
      {rabbitmq_amqp1_0,"AMQP 1.0 support for RabbitMQ","3.7.0"},
      {rabbitmq_management,"RabbitMQ Management Console","3.7.0"},
      {rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.7.0"},
      {rabbitmq_mqtt,"RabbitMQ MQTT Adapter","3.7.0"},
      {rabbitmq_management_agent,"RabbitMQ Management Agent","3.7.0"},
      {rabbitmq_web_stomp,"Rabbit WEB-STOMP - WebSockets to Stomp adapter",
          "3.7.0"},
      {rabbitmq_peer_discovery_k8s,
          "Kubernetes-based RabbitMQ peer discovery backend","3.7.0"},
      {rabbitmq_peer_discovery_common,
          "Modules shared by various peer discovery backends","3.7.0"},
      {rabbitmq_stomp,"RabbitMQ STOMP plugin","3.7.0"},
      {rabbitmq_shovel,"Data Shovel for RabbitMQ","3.7.0"},
      {rabbit,"RabbitMQ","3.7.0"},
      {amqp_client,"RabbitMQ AMQP Client","3.7.0"},
      {rabbit_common,
          "Modules shared by rabbitmq-server and rabbitmq-erlang-client",
          "3.7.0"},
      {recon,"Diagnostic tools for production use","2.3.2"},
      {ranch_proxy_protocol,"Ranch Proxy Protocol Transport","1.4.2"},
      {cowboy,"Small, fast, modern HTTP server.","2.0.0"},
      {ranch,"Socket acceptor pool for TCP protocols.","1.4.0"},
      {amqp10_client,"AMQP 1.0 client from the RabbitMQ Project","3.7.0"},
      {ssl,"Erlang/OTP SSL application","8.2.2"},
      {public_key,"Public key infrastructure","1.5.1"},
      {asn1,"The Erlang ASN1 compiler version 5.0.3","5.0.3"},
      {cowlib,"Support library for manipulating Web protocols.","2.0.0"},
      {mnesia,"MNESIA  CXC 138 12","4.15.1"},
      {amqp10_common,
          "Modules shared by rabbitmq-amqp1.0 and rabbitmq-amqp1.0-client",
          "3.7.0"},
      {jsx,"a streaming, evented json parsing toolkit","2.8.2"},
      {os_mon,"CPO  CXC 138 46","2.4.3"},
      {crypto,"CRYPTO","4.1"},
      {xmerl,"XML parser","1.3.15"},
      {inets,"INETS  CXC 138 49","6.4.4"},
      {lager,"Erlang logging framework","3.5.1"},
      {goldrush,"Erlang event stream processor","0.1.9"},
      {compiler,"ERTS  CXC 138 10","7.1.3"},
      {syntax_tools,"Syntax tools","2.1.3"},
      {sasl,"SASL  CXC 138 11","3.1"},
      {stdlib,"ERTS  CXC 138 10","3.4.2"},
      {kernel,"ERTS  CXC 138 10","5.4"}]},
 {os,{unix,linux}},
 {erlang_version,
     "Erlang/OTP 20 [erts-9.1.5] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:64] [hipe] [kernel-poll:true]\n"},
 {memory,
     [{connection_readers,0},
      {connection_writers,0},
      {connection_channels,0},
      {connection_other,2840},
      {queue_procs,0},
      {queue_slave_procs,0},
      {plugins,1032392},
      {other_proc,21626568},
      {metrics,199000},
      {mgmt_db,152952},
      {mnesia,94032},
      {other_ets,2391592},
      {binary,517704},
      {msg_index,29104},
      {code,33961625},
      {atom,1476769},
      {other_system,30720734},
      {allocated_unused,37760768},
      {reserved_unallocated,1581056},
      {strategy,rss},
      {total,[{erlang,92205312},{rss,131547136},{allocated,129966080}]}]},
 {alarms,[]},
 {listeners,
     [{clustering,25672,"::"},
      {amqp,5672,"::"},
      {stomp,61613,"::"},
      {'http/web-stomp',15674,"::"},
      {mqtt,1883,"::"},
      {http,15672,"::"}]},
 {vm_memory_calculation_strategy,rss},
 {vm_memory_high_watermark,{absolute,"256MB"}},
 {vm_memory_limit,256000000},
 {disk_free_limit,50000000},
 {disk_free,7848427520},
 {file_descriptors,
     [{total_limit,1048476},
      {total_used,2},
      {sockets_limit,943626},
      {sockets_used,0}]},
 {processes,[{limit,1048576},{used,447}]},
 {run_queue,0},
 {uptime,229},
 {kernel,{net_ticktime,60}}]

rabbitmqctl environment

bash-4.4# rabbitmqctl environment
Application environment of node [email protected] ...
[{amqp10_client,[]},
 {amqp10_common,[]},
 {amqp_client,[{prefer_ipv6,false},{ssl_options,[]}]},
 {asn1,[]},
 {compiler,[]},
 {cowboy,[]},
 {cowlib,[]},
 {crypto,[{fips_mode,false}]},
 {goldrush,[]},
 {inets,[]},
 {jsx,[]},
 {kernel,
     [{error_logger,tty},
      {inet_default_connect_options,[{nodelay,true}]},
      {inet_dist_listen_max,25672},
      {inet_dist_listen_min,25672}]},
 {lager,
     [{async_threshold,20},
      {async_threshold_window,5},
      {colored,false},
      {colors,
          [{debug,"\e[0;38m"},
           {info,"\e[1;37m"},
           {notice,"\e[1;36m"},
           {warning,"\e[1;33m"},
           {error,"\e[1;31m"},
           {critical,"\e[1;35m"},
           {alert,"\e[1;44m"},
           {emergency,"\e[1;41m"}]},
      {crash_log,"log/crash.log"},
      {crash_log_count,5},
      {crash_log_date,"$D0"},
      {crash_log_msg_size,65536},
      {crash_log_size,10485760},
      {error_logger_format_raw,true},
      {error_logger_hwm,100},
      {error_logger_redirect,true},
      {extra_sinks,
          [{error_logger_lager_event,
               [{handlers,[{lager_forwarder_backend,[lager_event,inherit]}]},
                {rabbit_handlers,
                    [{lager_forwarder_backend,[lager_event,inherit]}]}]},
           {rabbit_log_lager_event,
               [{handlers,[{lager_forwarder_backend,[lager_event,inherit]}]},
                {rabbit_handlers,
                    [{lager_forwarder_backend,[lager_event,inherit]}]}]},
           {rabbit_log_channel_lager_event,
               [{handlers,[{lager_forwarder_backend,[lager_event,inherit]}]},
                {rabbit_handlers,
                    [{lager_forwarder_backend,[lager_event,inherit]}]}]},
           {rabbit_log_connection_lager_event,
               [{handlers,[{lager_forwarder_backend,[lager_event,inherit]}]},
                {rabbit_handlers,
                    [{lager_forwarder_backend,[lager_event,inherit]}]}]},
           {rabbit_log_mirroring_lager_event,
               [{handlers,[{lager_forwarder_backend,[lager_event,inherit]}]},
                {rabbit_handlers,
                    [{lager_forwarder_backend,[lager_event,inherit]}]}]},
           {rabbit_log_queue_lager_event,
               [{handlers,[{lager_forwarder_backend,[lager_event,inherit]}]},
                {rabbit_handlers,
                    [{lager_forwarder_backend,[lager_event,inherit]}]}]},
           {rabbit_log_federation_lager_event,
               [{handlers,[{lager_forwarder_backend,[lager_event,inherit]}]},
                {rabbit_handlers,
                    [{lager_forwarder_backend,[lager_event,inherit]}]}]},
           {rabbit_log_upgrade_lager_event,
               [{handlers,[{lager_forwarder_backend,[lager_event,inherit]}]},
                {rabbit_handlers,
                    [{lager_forwarder_backend,[lager_event,inherit]}]}]}]},
      {handlers,
          [{lager_console_backend,
               [{formatter_config,
                    [date," ",time," ",color,"[",severity,"] ",
                     {pid,[]},
                     " ",message,"\n"]},
                {level,info}]}]},
      {log_root,"/var/log/rabbitmq"},
      {rabbit_handlers,
          [{lager_console_backend,
               [{formatter_config,
                    [date," ",time," ",color,"[",severity,"] ",
                     {pid,[]},
                     " ",message,"\n"]},
                {level,info}]}]}]},
 {mnesia,[{dir,"/var/lib/rabbitmq/mnesia/[email protected]"}]},
 {os_mon,
     [{start_cpu_sup,false},
      {start_disksup,false},
      {start_memsup,false},
      {start_os_sup,false}]},
 {public_key,[]},
 {rabbit,
     [{auth_backends,[rabbit_auth_backend_internal]},
      {auth_mechanisms,['PLAIN','AMQPLAIN']},
      {autocluster,
          [{peer_discovery_backend,rabbit_peer_discovery_classic_config}]},
      {background_gc_enabled,false},
      {background_gc_target_interval,60000},
      {backing_queue_module,rabbit_priority_queue},
      {channel_max,0},
      {channel_operation_timeout,15000},
      {cluster_formation,
          [{peer_discovery_backend,rabbit_peer_discovery_k8s},
           {node_cleanup,
               [{cleanup_interval,10},{cleanup_only_log_warning,false}]},
           {peer_discovery_k8s,
               [{k8s_host,"kubernetes.default.svc.cluster.local"},
                {k8s_address_type,ip}]}]},
      {cluster_keepalive_interval,10000},
      {cluster_nodes,{[],disc}},
      {cluster_partition_handling,autoheal},
      {collect_statistics,fine},
      {collect_statistics_interval,5000},
      {config_entry_decoder,
          [{cipher,aes_cbc256},
           {hash,sha512},
           {iterations,1000},
           {passphrase,undefined}]},
      {connection_max,infinity},
      {credit_flow_default_credit,{400,200}},
      {default_consumer_prefetch,{false,0}},
      {default_permissions,[<<".*">>,<<".*">>,<<".*">>]},
      {default_user,<<"admin">>},
      {default_user_tags,[administrator]},
      {default_vhost,<<"/">>},
      {delegate_count,16},
      {disk_free_limit,50000000},
      {disk_monitor_failure_retries,10},
      {disk_monitor_failure_retry_interval,120000},
      {enabled_plugins_file,"/etc/rabbitmq/enabled_plugins"},
      {fhc_read_buffering,false},
      {fhc_write_buffering,true},
      {frame_max,131072},
      {halt_on_upgrade_failure,true},
      {handshake_timeout,10000},
      {heartbeat,60},
      {hipe_compile,true},
      {hipe_modules,
          [rabbit_reader,rabbit_channel,gen_server2,rabbit_exchange,
           rabbit_command_assembler,rabbit_framing_amqp_0_9_1,rabbit_basic,
           rabbit_event,lists,queue,priority_queue,rabbit_router,rabbit_trace,
           rabbit_misc,rabbit_binary_parser,rabbit_exchange_type_direct,
           rabbit_guid,rabbit_net,rabbit_amqqueue_process,
           rabbit_variable_queue,rabbit_binary_generator,rabbit_writer,
           delegate,gb_sets,lqueue,sets,orddict,rabbit_amqqueue,
           rabbit_limiter,gb_trees,rabbit_queue_index,
           rabbit_exchange_decorator,gen,dict,ordsets,file_handle_cache,
           rabbit_msg_store,array,rabbit_msg_store_ets_index,rabbit_msg_file,
           rabbit_exchange_type_fanout,rabbit_exchange_type_topic,mnesia,
           mnesia_lib,rpc,mnesia_tm,qlc,sofs,proplists,credit_flow,pmon,
           ssl_connection,tls_connection,ssl_record,tls_record,gen_fsm,ssl]},
      {lager_default_file,tty},
      {lager_extra_sinks,
          [rabbit_log_lager_event,rabbit_log_channel_lager_event,
           rabbit_log_connection_lager_event,rabbit_log_mirroring_lager_event,
           rabbit_log_queue_lager_event,rabbit_log_federation_lager_event,
           rabbit_log_upgrade_lager_event]},
      {lager_log_root,"/var/log/rabbitmq"},
      {lager_upgrade_file,tty},
      {lazy_queue_explicit_gc_run_operation_threshold,1000},
      {log,[{console,[{enabled,true}]}]},
      {loopback_users,[]},
      {memory_monitor_interval,2500},
      {mirroring_flow_control,true},
      {mirroring_sync_batch_size,4096},
      {mnesia_table_loading_retry_limit,10},
      {mnesia_table_loading_retry_timeout,30000},
      {msg_store_credit_disc_bound,{4000,800}},
      {msg_store_file_size_limit,16777216},
      {msg_store_index_module,rabbit_msg_store_ets_index},
      {msg_store_io_batch_size,4096},
      {num_ssl_acceptors,10},
      {num_tcp_acceptors,10},
      {password_hashing_module,rabbit_password_hashing_sha256},
      {plugins_dir,"/opt/rabbitmq/plugins"},
      {plugins_expand_dir,
          "/var/lib/rabbitmq/mnesia/[email protected]"},
      {proxy_protocol,false},
      {queue_explicit_gc_run_operation_threshold,1000},
      {queue_index_embed_msgs_below,4096},
      {queue_index_max_journal_entries,32768},
      {reverse_dns_lookups,false},
      {server_properties,[]},
      {ssl_allow_poodle_attack,false},
      {ssl_apps,[asn1,crypto,public_key,ssl]},
      {ssl_cert_login_from,distinguished_name},
      {ssl_handshake_timeout,5000},
      {ssl_listeners,[]},
      {ssl_options,[]},
      {tcp_listen_options,
          [{backlog,128},
           {nodelay,true},
           {linger,{true,0}},
           {exit_on_close,false}]},
      {tcp_listeners,[5672]},
      {trace_vhosts,[]},
      {vhost_restart_strategy,continue},
      {vm_memory_calculation_strategy,rss},
      {vm_memory_high_watermark,{absolute,"256MB"}},
      {vm_memory_high_watermark_paging_ratio,0.5}]},
 {rabbit_common,[]},
 {rabbitmq_amqp1_0,
     [{default_user,"guest"},
      {default_vhost,<<"/">>},
      {protocol_strict_mode,false}]},
 {rabbitmq_consistent_hash_exchange,[]},
 {rabbitmq_federation,
     [{internal_exchange_check_interval,30000},
      {pgroup_name_cluster_id,false}]},
 {rabbitmq_federation_management,[]},
 {rabbitmq_management,
     [{cors_allow_origins,[]},
      {cors_max_age,1800},
      {http_log_dir,none},
      {listener,[{ssl,false},{port,15672}]},
      {load_definitions,none},
      {management_db_cache_multiplier,5},
      {process_stats_gc_timeout,300000},
      {stats_event_max_backlog,250}]},
 {rabbitmq_management_agent,
     [{rates_mode,basic},
      {sample_retention_policies,
          [{global,[{605,5},{3660,60},{29400,600},{86400,1800}]},
           {basic,[{605,5},{3600,60}]},
           {detailed,[{605,5}]}]}]},
 {rabbitmq_mqtt,
     [{allow_anonymous,true},
      {default_user,<<"guest">>},
      {exchange,<<"amq.topic">>},
      {num_ssl_acceptors,1},
      {num_tcp_acceptors,10},
      {prefetch,10},
      {proxy_protocol,false},
      {retained_message_store,rabbit_mqtt_retained_msg_store_dets},
      {retained_message_store_dets_sync_interval,2000},
      {ssl_cert_login,false},
      {ssl_listeners,[]},
      {subscription_ttl,86400000},
      {tcp_listen_options,[{backlog,128},{nodelay,true}]},
      {tcp_listeners,[1883]},
      {vhost,<<"/">>}]},
 {rabbitmq_peer_discovery_common,[]},
 {rabbitmq_peer_discovery_k8s,[]},
 {rabbitmq_shovel,
     [{defaults,
          [{prefetch_count,1000},
           {ack_mode,on_confirm},
           {publish_fields,[]},
           {publish_properties,[]},
           {reconnect_delay,5}]}]},
 {rabbitmq_shovel_management,[]},
 {rabbitmq_stomp,
     [{default_user,[{login,<<"guest">>},{passcode,<<"guest">>}]},
      {default_vhost,<<"/">>},
      {hide_server_info,false},
      {implicit_connect,false},
      {num_ssl_acceptors,1},
      {num_tcp_acceptors,10},
      {proxy_protocol,false},
      {ssl_cert_login,false},
      {ssl_listeners,[]},
      {tcp_listen_options,[{backlog,128},{nodelay,true}]},
      {tcp_listeners,[61613]},
      {trailing_lf,true}]},
 {rabbitmq_web_dispatch,[]},
 {rabbitmq_web_stomp,
     [{cowboy_opts,[]},
      {num_ssl_acceptors,1},
      {num_tcp_acceptors,10},
      {port,15674},
      {ssl_config,[]},
      {tcp_config,[]},
      {use_http_auth,false},
      {ws_frame,text}]},
 {ranch,[]},
 {ranch_proxy_protocol,[{proxy_protocol_timeout,55000},{ssl_accept_opts,[]}]},
 {recon,[]},
 {sasl,[{errlog_type,error},{sasl_error_logger,tty}]},
 {ssl,[{protocol_version,['tlsv1.2','tlsv1.1',tlsv1]}]},
 {stdlib,[]},
 {syntax_tools,[]},
 {xmerl,[]}]

RabbitMQ Kubernetes pod getting restared

Hi Team,

System Information:

Helios 7.6 SP 1,
Kubernetes Version v1.14.1
docker version 18.09.5
helm version v2.13.1

Issue Details:

I am deploying Rabbitmq in a kubernetes pod with helm chat and the pod is getting restarted continuously, then I checked the logs of rabbitmq-ha the I found below error message, I am not sure how to fix this error, please help in resolving the issue.

Error in the Log: command to get below log (kubectl logs rabbitmq-rabbitmq-ha-0 rabbitmq-ha)

2019-04-25 08:58:41.319 [error] <0.281.0> CRASH REPORT Process <0.281.0> with 0 neighbours exited with reason: {error,not_json} in application_master:init/4 line 138
2019-04-25 08:58:41.320 [info] <0.43.0> Application rabbit exited with reason: {error,not_json}
{"Kernel pid terminated",application_controller,"{application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{error,not_json}}}}}"}
Kernel pid terminated (application_controller) ({application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{error,not_json}}}}})

Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done

Create examples for K8s

Create examples for K8s backend

If 3 pods start at the same time, sometimes cluster becomes partitioned

This is my k8s Deployment:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    app: api-celery-rabbit
  name: api-celery-rabbit
spec:
  replicas: 3
  revisionHistoryLimit: 2
  selector:
    matchLabels:
      app: api-celery-rabbit
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: api-celery-rabbit
    spec:
      containers:
      - command:
        - sh
        - -c
        - |
          set -e

          cat <<EOF > /etc/rabbitmq/rabbitmq.conf
          ## Clustering
          cluster_formation.peer_discovery_backend = rabbit_peer_discovery_k8s
          cluster_formation.k8s.service_name =  api-celery-rabbit
          cluster_formation.k8s.address_type = ip
          cluster_formation.k8s.host = kubernetes.default
          cluster_formation.node_cleanup.interval = 10
          cluster_formation.node_cleanup.only_log_warning = false
          cluster_partition_handling = autoheal
          ## queue master locator
          queue_master_locator=min-masters
          ## enable guest user
          loopback_users.guest = false
          EOF

          echo "[rabbitmq_management,rabbitmq_peer_discovery_k8s]." > /etc/rabbitmq/enabled_plugins

          sleep $(awk 'BEGIN {srand(); printf "%d\n", rand()*30}')
          exec docker-entrypoint.sh rabbitmq-server
        env:
        - name: RABBITMQ_VM_MEMORY_HIGH_WATERMARK
          value: "0.50"
        - name: MY_POD_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.podIP
        - name: RABBITMQ_ERLANG_COOKIE
          value: secretcookiehere
        - name: RABBITMQ_NODENAME
          value: rabbit@$(MY_POD_IP)
        - name: RABBITMQ_USE_LONGNAME
          value: "true"
        image: docker.gambit/rabbitmq:3.7.3
        imagePullPolicy: IfNotPresent
        lifecycle:
          postStart:
            exec:
              command:
              - sh
              - -c
              - |
                sleep 30
                rabbitmqctl set_policy ha-all "^celery" '{"ha-mode":"all"}'
        livenessProbe:
          exec:
            command:
            - rabbitmqctl
            - status
          failureThreshold: 3
          initialDelaySeconds: 30
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 10
        name: rabbitmq
        ports:
        - containerPort: 5672
          name: amqp
          protocol: TCP
        - containerPort: 15672
          name: http
          protocol: TCP
        readinessProbe:
          exec:
            command:
            - rabbitmqctl
            - status
          failureThreshold: 3
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 10
        resources:
          limits:
            cpu: "1"
            memory: 1Gi
          requests:
            cpu: 20m
            memory: 512Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/lib/rabbitmq
          name: data
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: api-celery-rabbit
      serviceAccountName: api-celery-rabbit
      terminationGracePeriodSeconds: 10
      volumes:
      - emptyDir: {}
        name: data

It often needs manual nursing to restart pods slowly one by one until they're all part of the same cluster.

Error during cleanup

By configuring node_cleanup :

cluster_formation.peer_discovery_backend  = rabbit_peer_discovery_k8s
cluster_formation.k8s.host = kubernetes.default.svc.cluster.local
cluster_formation.node_cleanup.interval=10
cluster_formation.node_cleanup.only_log_warning = false

it raises an error:

12:22:57.838 [error] ** Generic server rabbit_peer_discovery_cleanup terminating 
** Last message in was check_cluster
** When Server state == {state,10,false,{interval,#Ref<0.0.2.46887>}}
** Reason for termination == 
** {{case_clause,{ok,['[email protected]','[email protected]','[email protected]']}},[{rabbit_peer_discovery_cleanup,service_discovery_nodes,0,[{file,"src/rabbit_peer_discovery_cleanup.erl"},{line,297}]},{rabbit_peer_discovery_cleanup,maybe_cleanup,2,[{file,"src/rabbit_peer_discovery_cleanup.erl"},{line,241}]},{rabbit_peer_discovery_cleanup,handle_call,3,[{file,"src/rabbit_peer_discovery_cleanup.erl"},{line,130}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,615}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,647}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}

This because list_nodes() in K8s gets:

➜  k8s git:(rabbitmq-peer-discovery-k8s_5) ✗ kubectl exec --namespace=test-rabbitmq $FIRST_POD rabbitmqctl eval 'rabbit_peer_discovery_k8s:list_nodes().'
{ok,['[email protected]','[email protected]','[email protected]']}

that it does not match with Module:list_nodes() here

Nodes fail to cluster because their Erlang cookies are out of sync

When using the example k8s resources in this repo, the RabbitMQ nodes fail to cluster because their Erlang cookie values are out of sync.

Would it make sense to automate that at some level as a part of the new peer discovery system?

At the very least documenting a good way to set that cluster-wide and/or adding it to the example would be good.

Peer discovery should not cross namespace boundary

I have multiple rabbitmq clusters deployed in a single k8s cluster using the k8s discovery backend, and they all end up trying to peer with each other.

Since each cluster has a separate Erlang cookie, they are unsuccessful in replicating, but the logs are littered with messages like:

2018-08-20 22:02:38.901 [error] &lt;0.6449.0&gt; ** Connection attempt from disallowed node '[email protected]' **

The k8s peering backend should limit to $self.namespace by default and allow an override, or should have an option to limit peering to its own namespace.

To repro, you should be able to build a cluster with the reference StatefulSet config, then change the namespace and Erlang cookie to deploy a second copy. Both sets of clusters will discover the others' peers, and will be continuously logging disallowed messages.

sed: couldn't open temporary file /etc/rabbitmq/sedl0t5ty: Read-only file system

If I add my custom user by using RABBITMQ_DEFAULT_USER , It faild with

sed: couldn't open temporary file /etc/rabbitmq/sedl0t5ty: Read-only file system

k8s_statefulsets health check does not work correctly

If a node comes up and cannot join the cluster it still reports healthy in K8s. It should report as unhealthy.

command: ["rabbitmqctl", "status"]

does not seem to be the best way to detect node health in a cluster.

About auto AUTOCLUSTER_CLEANUP

Hello, my idol @michaelklishin
I have already used autocluster-0.8.0 at 3.6x , peer-discovery-k8s at 3.7.x in k8s.

Usually we use rabbitmq with some volume to store the mnesia(message/user/virtul hosts). When all of nodes down, the volume also have the information. So we can recover from it.

But when some one down for a long time, the other nodes will clear the node information from the cluster. When it up, the information in the volume say "you should join the cluster", but the cluster will disagree...So the node can not be up.

I have already find one way to slove the problem, delete the volume . But it's not auto.
I tried to set AUTOCLUSTER_CLEANUP=false, but sometime also partition... It's so dangerous, so i give it up.

Could you give me some suggestion? I'm puzzled for this problem for a long time.
Thanks.

Support registration in order to support randomised startup delay (RSD)

Same as rabbitmq/rabbitmq-peer-discovery-aws#17 but in this plugin.

Until/if we split randomized startup delay and registration, this plugin should indicate that it supports registration even though it actually happens out of band, much like with its AWS counterpart.

Also worth looking into reducing the range if we can do it without ugly hacks (the server may call plugin callbacks later than we need in this case). As explained in #22, Kubernetes will initialise stateful set pods one by one, so technically RDS isn't necessary here.

References #22.

Random crash on startup

Hello, I'm using the statefulset example with the peer discovery plugin on a kubernetes cluster ( version 1.9.5 ). I have succesfully set up the rbac, the service with configmaps and the statefulset with 3 replicas and everything is working fine. If I delete a pod (in order to make use of the clustering feature and the node cleanup process) the pod that gets deleted, randomly crashes. If i delete the pod manually two or three times it starts up without error. The same thing happens if I scale the statefulset up or down, it sometimes starts in seconds or sometimes crashes. I can't reproduce the error since it has random behavior. Could you help me identify the problem? I'm pasting the log on a crashed pod below. Thanks

2018-06-20 10:41:52.503 [info] <0.33.0> Application lager started on node '[email protected]'
2018-06-20 10:41:52.953 [info] <0.33.0> Application crypto started on node '[email protected]'
2018-06-20 10:41:52.953 [info] <0.33.0> Application xmerl started on node '[email protected]'
2018-06-20 10:41:53.079 [info] <0.33.0> Application mnesia started on node '[email protected]'
2018-06-20 10:41:53.086 [info] <0.33.0> Application os_mon started on node '[email protected]'
2018-06-20 10:41:53.086 [info] <0.33.0> Application recon started on node '[email protected]'
2018-06-20 10:41:53.086 [info] <0.33.0> Application cowlib started on node '[email protected]'
2018-06-20 10:41:53.086 [info] <0.33.0> Application jsx started on node '[email protected]'
2018-06-20 10:41:53.153 [info] <0.33.0> Application inets started on node '[email protected]'
2018-06-20 10:41:53.153 [info] <0.33.0> Application asn1 started on node '[email protected]'
2018-06-20 10:41:53.154 [info] <0.33.0> Application public_key started on node '[email protected]'
2018-06-20 10:41:53.211 [info] <0.33.0> Application ssl started on node '[email protected]'
2018-06-20 10:41:53.216 [info] <0.33.0> Application ranch started on node '[email protected]'
2018-06-20 10:41:53.216 [info] <0.33.0> Application ranch_proxy_protocol started on node '[email protected]'
2018-06-20 10:41:53.218 [info] <0.33.0> Application cowboy started on node '[email protected]'
2018-06-20 10:41:53.218 [info] <0.33.0> Application rabbit_common started on node '[email protected]'
2018-06-20 10:41:53.225 [info] <0.33.0> Application amqp_client started on node '[email protected]'
2018-06-20 10:41:53.233 [info] <0.201.0>
Starting RabbitMQ 3.7.6 on Erlang 20.3.8
Copyright (C) 2007-2018 Pivotal Software, Inc.
Licensed under the MPL. See http://www.rabbitmq.com/

########## Licensed under the MPL. See http://www.rabbitmq.com/

########## Logs:

          Starting broker...

2018-06-20 10:41:53.244 [info] <0.201.0>
node : [email protected]
home dir : /var/lib/rabbitmq
config file(s) : /etc/rabbitmq/rabbitmq.conf
cookie hash : 1f/OCGDEIYGjxECOyabp7w==
log(s) :
database dir : /var/lib/rabbitmq/mnesia/[email protected]
2018-06-20 10:41:55.228 [info] <0.209.0> Memory high watermark set to 3128 MiB (3280984473 bytes) of 7822 MiB (8202461184 bytes) total
2018-06-20 10:41:55.232 [info] <0.211.0> Enabling free disk space monitoring
2018-06-20 10:41:55.232 [info] <0.211.0> Disk free limit set to 50MB
2018-06-20 10:41:55.234 [info] <0.213.0> Limiting to approx 1048476 file handles (943626 sockets)
2018-06-20 10:41:55.234 [info] <0.214.0> FHC read buffering: OFF
2018-06-20 10:41:55.234 [info] <0.214.0> FHC write buffering: ON
2018-06-20 10:41:55.235 [info] <0.201.0> Node database directory at /var/lib/rabbitmq/mnesia/[email protected] is empty. Assuming we need to join an existing cluster or initialise from scratch...
2018-06-20 10:41:55.235 [info] <0.201.0> Configured peer discovery backend: rabbit_peer_discovery_k8s
2018-06-20 10:41:55.236 [info] <0.201.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s
2018-06-20 10:41:55.236 [info] <0.201.0> Peer discovery backend does not support locking, falling back to randomized delay
2018-06-20 10:41:55.236 [info] <0.201.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized startup delay.
2018-06-20 10:41:55.250 [info] <0.201.0> Failed to get nodes from k8s - 403
2018-06-20 10:41:55.251 [error] <0.200.0> CRASH REPORT Process <0.200.0> with 0 neighbours exited with reason: no case clause matching {error,"403"} in rabbit_mnesia:init_from_config/0 line 164 in application_master:init/4 line 134
2018-06-20 10:41:55.251 [info] <0.33.0> Application rabbit exited with reason: no case clause matching {error,"403"} in rabbit_mnesia:init_from_config/0 line 164
{"Kernel pid terminated",application_controller,"{application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{{case_clause,{error,"403"}},[{rabbit_mnesia,init_from_config,0,[{file,"src/rabbit_mnesia.erl"},{line,164}]},{rabbit_mnesia,init_with_lock,3,[{file,"src/rabbit_mnesia.erl"},{line,144}]},{rabbit_mnesia,init,0,[{file,"src/rabbit_mnesia.erl"},{line,111}]},{rabbit_boot_steps,'-run_step/2-lc$^1/1-1-',1,[{file,"src/rabbit_boot_steps.erl"},{line,49}]},{rabbit_boot_steps,run_step,2,[{file,"src/rabbit_boot_steps.erl"},{line,49}]},{rabbit_boot_steps,'-run_boot_steps/1-lc$^0/1-0-',1,[{file,"src/rabbit_boot_steps.erl"},{line,26}]},{rabbit_boot_steps,run_boot_steps,1,[{file,"src/rabbit_boot_steps.erl"},{line,26}]},{rabbit,start,2,[{file,"src/rabbit.erl"},{line,801}]}]}}}}}"}
Kernel pid terminated (application_controller) ({application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{{case_clause,{error,"403"}},[{rabbit_mnesia,init_from_config,0,[{file

Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done

Minikube example is arguably incomplete

Many users have posted issues regarding service discovery not happening, even on google group.

#40

The Example File for StateFul Sets, does not provide a spec for DNS resolution, which is supposed to happen inside the deployed pods.

In order to facilitate this, we must create an additional service of type ClusterIP or edit an existing one.

Additional Service looks like :

kind: Service
apiVersion: v1
metadata:
  namespace: default
  name: rabbitmq-ext
  labels:
    app: rabbitmq
spec:
  ports:
    - name: http
      protocol: TCP
      port: 15672
      targetPort: 15672
      nodePort: 31672
    - name: amqp
      protocol: TCP
      port: 5672
      targetPort: 5672
      nodePort: 30672
  selector:
    app: rabbitmq

The correct config must look like the one below.

updated_rabbitmq_statefulsets.txt

Queue Mirrorring

Hi,
This plugin is great. But to achieve HA , how would I go about setting up mirrorring policy on all queues. Mirroring improves the availability , but at a cost of performance. In many systems message loss cannot be tolerated and mirroring solves the problem.
Can you provide any insight on how to automatically enable mirroring even when pods restart or are scaled.

Need add k8s RBAC configuration for plugin to work correctly

Deploy use the example rabbitmq_statefulset.yaml, and got following errors

2018-01-09 11:09:44.691 [info] <0.193.0> Node database directory at /var/lib/rabbitmq/mnesia/[email protected] is empty. Assuming we need to join an existing cluster or initialise from scratch...
2018-01-09 11:09:44.691 [info] <0.193.0> Configured peer discovery backend: rabbit_peer_discovery_k8s
2018-01-09 11:09:44.691 [info] <0.193.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s
2018-01-09 11:09:44.691 [info] <0.193.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized startup delay.
2018-01-09 11:09:44.710 [info] <0.193.0> Failed to get nodes from k8s - 403
2018-01-09 11:09:44.711 [error] <0.192.0> CRASH REPORT Process <0.192.0> with 0 neighbours exited with reason: no case clause matching {error,"403"} in rabbit_mnesia:init_from_config/0 line 163 in application_master:init/4 line 134
2018-01-09 11:09:44.711 [info] <0.33.0> Application rabbit exited with reason: no case clause matching {error,"403"} in rabbit_mnesia:init_from_config/0 line 163
Kernel pid terminated (application_controller) ({application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{{case_clause,{error,"403"}},[{rabbit_mnesia,init_from_config,0,[{file
{"Kernel pid terminated",application_controller,"{application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{{case_clause,{error,\"403\"}},[{rabbit_mnesia,init_from_config,0,[{file,\"src/rabbit_mnesia.erl\"},{line,163}]},{rabbit_mnesia,init_with_lock,3,[{file,\"src/rabbit_mnesia.erl\"},{line,143}]},{rabbit_mnesia,init,0,[{file,\"src/rabbit_mnesia.erl\"},{line,111}]},{rabbit_boot_steps,'-run_step/2-lc$^1/1-1-',1,[{file,\"src/rabbit_boot_steps.erl\"},{line,49}]},{rabbit_boot_steps,run_step,2,[{file,\"src/rabbit_boot_steps.erl\"},{line,49}]},{rabbit_boot_steps,'-run_boot_steps/1-lc$^0/1-0-',1,[{file,\"src/rabbit_boot_steps.erl\"},{line,26}]},{rabbit_boot_steps,run_boot_steps,1,[{file,\"src/rabbit_boot_steps.erl\"},{line,26}]},{rabbit,start,2,[{file,\"src/rabbit.erl\"},{line,802}]}]}}}}}"}

Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done

The problem comes from http 403 error when try to get nodes from API server:

Failed to get nodes from k8s - 403

For kubernetes 1.6 or above, it enabled RBAC by default, so it need set correct RBAC stuff for rabbit_peer_discovery_k8s plugin to retrieve info from k8s cluster successfully.

In case k8s API Server goes down unexpectedly, the plugin breaks clustering and isn't capable of recovering

In a K8S cluster where RMQ 3.7.1 was installed with 3 nodes and the k8s peer discovery, there has been a temporary failure in the API server/internal networking, which led to the following events:

2018-01-24 00:18:06.171 [info] <0.375.0> rabbit on node '[email protected]' down
2018-01-24 00:18:06.639 [info] <0.375.0> Node [email protected] is down, deleting its listeners
2018-01-24 00:18:06.641 [info] <0.375.0> node '[email protected]' down: connection_closed
2018-01-24 00:18:28.971 [info] <0.470.0> Failed to get nodes from k8s - {failed_connect,[{to_address,{"kubernetes.default.svc.cluster.local",443}},
{inet,[inet],nxdomain}]}
2018-01-24 00:18:28.971 [warning] <0.470.0> Peer discovery: removing unknown node [email protected] from the cluster
2018-01-24 00:18:28.971 [info] <0.470.0> Removing node '[email protected]' from cluster
2018-01-24 00:18:30.986 [info] <0.369.0> Node '[email protected]' was removed from the cluster, deleting its connection tracking tables...
2018-01-24 00:18:32.718 [warning] <0.18522.33> closing AMQP connection <0.18522.33> (10.244.0.25:34683 -> 10.244.1.5:5672, vhost: '/', user: 'guest'):
client unexpectedly closed TCP connection
2018-01-24 00:18:54.780 [warning] <0.879.0> closing AMQP connection <0.879.0> (10.244.0.11:40485 -> 10.244.1.5:5672, vhost: '/', user: 'guest'):
client unexpectedly closed TCP connection

The nodes have been detached from the cluster and, despite the fact the failure of the API server was temporary and it recovered some minutes later, they never rejoined the cluster, leaving them hanging in the k8s service. Other clustered services, such as Cassandra, didn't experience the same issue, and managed to reconnect successfully:

INFO 00:58:44 InetAddress /10.244.3.2 is now DOWN
INFO 00:59:01 Redistributing index summaries
INFO 01:00:59 Handshaking version with /10.244.3.2
INFO 01:01:01 Handshaking version with /10.244.3.2
INFO 01:01:01 InetAddress /10.244.3.2 is now UP

The peer discovery plugin should be able to recover from such failures and restore the cluster state.

Liveness probes run too often in the example deployment

Hi,

There are probes in example:

        livenessProbe:
          exec:
            command: ["rabbitmqctl", "status"]
          initialDelaySeconds: 30
          timeoutSeconds: 10
        readinessProbe:
          exec:
            command: ["rabbitmqctl", "status"]
          initialDelaySeconds: 10
          timeoutSeconds: 10

In our cluster, these commands were executed every 10 seconds which led to enormous CPU usage (70%) for rabbitmq cluster without any workload.
It would be nice to mention that this command is so heavyweight and provide some sane periodSeconds for example.

cluster_formation.peer_discovery_backend aliases

This plugin's name is rabbitmq_peer_discovery_k8s but the module name that must be configured for discovery is

cluster_formation.peer_discovery_backend = rabbit_peer_discovery_k8s

which confuses some users and leads to hours spent troubleshooting and a lot of frustration.

We should introduce an alias module and consider an alias shortcut supported in rabbitmq.conf so that these config values

cluster_formation.peer_discovery_backend = rabbitmq_peer_discovery_k8s

cluster_formation.peer_discovery_backend = k8s

would be accepted.