I am currently load testing my application that subscribes users to a muc room after processing their payments. I have deployed the image ejabberd/ecs:18.09
as a pod in a kubernetes cluster with my backend set up to call the three following APIs in sequential order:
api/set_room_affiliation
api/subscribe_room
api/srg_user_add
The ejabberd pod is scheduled in a node with 16 cpu cores and 32GB of RAM (c5.4xlarge EC2 instance on Amazon AWS) where I also have grafana and prometheus setup to monitor the pods resource consumption. During a test run of 10 RPS, the response times stayed under a minute, however when increasing the load to 100 RPS, majority of the response times exceeded 5 minutes which resulted in 504 Gateway timeout errors. I monitored the pods resource consumption during the test run and found that it was using between 12 and 15 cpu cores and up to 5Gb which didn't seem to be overloading the node. The ejabberd pod is also connected to an MySQL RDS r5.xlarge instance (4CPU and 32GB RAM) with a 100GB SSD at 1000 IOPS (provisioned IOPS). I suspected it could have been a bottleneck with the database but I've checked the write IOPS and it was at most 500 counts/sec – the RDS cpu was at 20% and the free-able memory was at ~28GB. The backend also occasionally received 400 Bad Request errors from the ejabberd pod but I didn't get any errors or warnings in the ejabberd error.log file for these. I did however get a few of the following errors:
2019-05-03 11:01:39.608 [error] <0.1836.1>@ejabberd_sm:route:146 failed to route packet:
#message{
id = <<>>,type = normal,lang = <<>>,
from =
#jid{
user = <<"5ccab488e1abbf0001b2dd47_2_l">>,
server = <<"muc.xmpp.example.com">>,resource = <<>>,
luser = <<"5ccab488e1abbf0001b2dd47_2_l">>,
lserver = <<"muc.xmpp.example.com">>,lresource = <<>>},
to =
#jid{
user = <<"jbvksnzus2kp6u41mekayi2h55pjga">>,
server = <<"xmpp.example.com">>,resource = <<>>,
luser = <<"jbvksnzus2kp6u41mekayi2h55pjga">>,
lserver = <<"xmpp.example.com">>,lresource = <<>>},
subject = [],body = [],thread = undefined,
sub_els =
[#ps_event{
items =
#ps_items{
xmlns = <<>>,node = <<"urn:xmpp:mucsub:nodes:messages">>,
items =
[#ps_item{
xmlns = <<>>,id = <<"6420966739914419237">>,
sub_els =
[#message{
id = <<"9782689453704005349">>,type = groupchat,lang = <<>>,
from =
#jid{
user = <<"5ccab488e1abbf0001b2dd47_2_l">>,
server = <<"muc.xmpp.example.com">>,
resource = <<"uttersystem">>,
luser = <<"5ccab488e1abbf0001b2dd47_2_l">>,
lserver = <<"muc.xmpp.example.com">>,
lresource = <<"uttersystem">>},
to =
#jid{
user = <<"jbvksnzus2kp6u41mekayi2h55pjga">>,
server = <<"xmpp.example.com">>,resource = <<>>,
luser = <<"jbvksnzus2kp6u41mekayi2h55pjga">>,
lserver = <<"xmpp.example.com">>,lresource = <<>>},
subject = [#text{lang = <<"en">>,data = <<"user:joined">>}],
body =
[#text{
lang = <<"en">>,
data =
<<"{\"displayName\":\"dEdibawitIbQPkyQmoZhsEoqNTLRwObIxpghFTaBFxbslOrzAaaqIVbtTrHCN\",\"username\":\"hn288kt338srg1240xiti7d19racvl\"}">>}],
thread = undefined,
sub_els =
[#mam_archived{
by =
#jid{
user = <<"5ccab488e1abbf0001b2dd47_2_l">>,
...
Reason = {error,{{badmatch,{error,timeout}},[{ejabberd_auth_http,make_req,5,[{file,"/home/ejabberd/.ejabberd-modules/sources/ejabberd-contrib/ejabberd_auth_http/src/ejabberd_auth_http.erl"},{line,225}]},{ejabberd_auth_http,user_exists,2,[{file,"/home/ejabberd/.ejabberd-modules/sources/ejabberd-contrib/ejabberd_auth_http/src/ejabberd_auth_http.erl"},{line,163}]},{ejabberd_auth,'-user_exists/2-fun-0-',3,[{file,"src/ejabberd_auth.erl"},{line,386}]},{lists,any,2,[{file,"lists.erl"},{line,1225}]},{ejabberd_sm,route_message,1,[{file,"src/ejabberd_sm.erl"},{line,731}]},{ejabberd_sm,route,1,[{file,"src/ejabberd_sm.erl"},{line,143}]},{ejabberd_local,route,1,[{file,"src/ejabberd_local.erl"},{line,72}]},{ejabberd_router,do_route,1,[{file,"src/ejabberd_router.erl"},{line,368}]}]}}
Are there any optimisations I can do to improve the response times? I was also thinking of installing ejabberd on an server instead of containerizing it but I am not sure if this would make a difference.