The HSDS service node (started from the runall.sh running in AWS), has crashed

Later in the same log: <div class="snippet-clipboard-content notranslate position-

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

HSDS service node crash about hsds HOT 11 CLOSED

hdfgroup commented on May 16, 2024

HSDS service node crash

from hsds.

Comments (11)

jpswinski commented on May 16, 2024

Later in the same log:

INFO> health check start
INFO> node_state: INITIALIZING
INFO> register: http://hsds_head:5100/register
INFO> http_post('http://hsds_head:5100/register', {'id': 'sn-5d182', 'port': 80, 'node_type': 'sn'})
INFO> Initiating TCPConnector with limit 100 connections
INFO> http_post status: 500
WARN> POST request error for url: http://hsds_head:5100/register - status: 500
ERROR> HEAD node seems to be down.
Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7f69326b5dc0>
Task exception was never retrieved
future: <Task finished name='Task-3' coro=<healthCheck() done, defined at /usr/local/lib/python3.8/site-packages/hsds/basenode.py:414> exception=SystemExit(1)>
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/hsds/basenode.py", line 87, in register
    rsp_json = await http_post(app, req_reg, data=body)
  File "/usr/local/lib/python3.8/site-packages/hsds/util/httpUtil.py", line 157, in http_post
    raise HTTPInternalServerError()
aiohttp.web_exceptions.HTTPInternalServerError: Internal Server ErrorDuring handling of the above exception, another exception occurred:Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/aiohttp/web.py", line 419, in run_app
    loop.run_until_complete(_run_app(app,
  File "/usr/local/lib/python3.8/asyncio/base_events.py", line 603, in run_until_complete
    self.run_forever()
  File "/usr/local/lib/python3.8/asyncio/base_events.py", line 570, in run_forever
    self._run_once()
  File "/usr/local/lib/python3.8/asyncio/base_events.py", line 1859, in _run_once
    handle._run()
  File "/usr/local/lib/python3.8/asyncio/events.py", line 81, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.8/site-packages/hsds/basenode.py", line 433, in healthCheck
    await register(app)
  File "/usr/local/lib/python3.8/site-packages/hsds/basenode.py", line 96, in register
    sys.exit(1)
SystemExit: 1

from hsds.

jreadey commented on May 16, 2024

Anything odd in the logs for the head_node?

from hsds.

jpswinski commented on May 16, 2024

I didn't see anything pop out to me in the head node logs. I will grab the whole thing the next time it happens. It did seem odd to me that the error in the service node log starts with the line that says that the hsds-servicenode is killed. Not sure what that means.

from hsds.

jpswinski commented on May 16, 2024

There was another service node crash. I verified that there was around 7GB of memory available on the system at the time of the crash. Here is the head node log at the time of the crash:

INFO> health check 2020-11-12T06:27:42Z, cluster_state: READY, node_count: 5
INFO> http_get('http://172.25.0.3:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.7:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.5:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.4:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.6:80/info')
INFO> http_get status: 200
INFO> node health check fail_count: 0
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
REQ> POST: /register [hsds_head:5100]
INFO> body: {'id': 'sn-bd03f', 'port': 80, 'node_type': 'sn'}
INFO> register host: 172.25.0.6, port: 59620
Error handling request
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/aiohttp/web_protocol.py", line 418, in start
    resp = await task
  File "/usr/local/lib/python3.8/site-packages/aiohttp/web_app.py", line 458, in _handle
    resp = await handler(request)
  File "/usr/local/lib/python3.8/site-packages/hsds/headnode.py", line 283, in register
    "host": body['ip'],
KeyError: 'ip'
REQ> POST: /register [hsds_head:5100]
INFO> body: {'id': 'sn-f3acc', 'port': 80, 'node_type': 'sn'}
INFO> register host: 172.25.0.6, port: 59624
Error handling request
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/aiohttp/web_protocol.py", line 418, in start
    resp = await task
  File "/usr/local/lib/python3.8/site-packages/aiohttp/web_app.py", line 458, in _handle
    resp = await handler(request)
  File "/usr/local/lib/python3.8/site-packages/hsds/headnode.py", line 283, in register
    "host": body['ip'],
KeyError: 'ip'
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
INFO> health check 2020-11-12T06:27:52Z, cluster_state: READY, node_count: 5
INFO> http_get('http://172.25.0.3:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.7:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.5:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.4:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.6:80/info')
INFO> http_get status: 200
WARN> unexpected node_id: sn-a0076 (expecting: sn-26be6)
INFO> node health check fail_count: 1
WARN> Fail_count > 0, Setting cluster_state from READY to INITIALIZING
REQ> POST: /register [hsds_head:5100]
INFO> body: {'id': 'sn-a0076', 'port': 80, 'node_type': 'sn'}
INFO> register host: 172.25.0.6, port: 59628
INFO> Found free node reference: {'node_number': 0, 'node_type': 'sn', 'host': None, 'port': 80, 'id': None, 'connected': '2020-11-11T20:45:34Z', 'failcount': 0, 'healthcheck': '2020-11-12T06:27:42Z'}
INFO> inactive_node_count: 0
INFO> setting cluster state to READY - was: INITIALIZING
INFO RSP> <200> (OK): /register
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
INFO> health check 2020-11-12T06:28:02Z, cluster_state: READY, node_count: 5

from hsds.

jreadey commented on May 16, 2024

@jpswinski - are you dynamically changing the number of containers?

I get the same error if I do: docker-compose -f ${COMPOSE_FILE} scale dn=4

from hsds.

jpswinski commented on May 16, 2024

I'm not. The crashes occur randomly during normal use. For instance, it will happen in the middle of reading a dataset.

from hsds.

jreadey commented on May 16, 2024

Ok, I think what is happening is that after the service node crash, docker is respawning the container. Then when the container registers with the head node, there is a big that is causing the head node to crash.

I have a fix for the headnode (4d31769) so that nodes will be able to re-register. Try this and let me know how it goes.

I don't think this will fix the problem with the service node crashing, but let's first verify the head node fix.

from hsds.

jpswinski commented on May 16, 2024

I didn't see the head node crash - after reporting the error above, it keeps going and ultimately reports the cluster back in the READY state. For our user facing server, I had to set the restart policy to always so that we can recover from the service node crashes automatically. This is working, and the system restarts the service node after a crash and HSDS is useable again. But the server node crash is still a problem, because it stopping long processing runs right in the middle of processing; requiring them to start over.

from hsds.

jreadey commented on May 16, 2024

@jpswinski - do you still have a problem with service node restarts?

from hsds.

jpswinski commented on May 16, 2024

No! It looks like the problem was the amount of memory docker allowed for the container. If it reaches that limit, it will kill the container and restart it.

from hsds.

jreadey commented on May 16, 2024

Looks like this is resolved - closing

from hsds.

HSDS service node crash about hsds HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent