Git Product home page Git Product logo

Comments (11)

jpswinski avatar jpswinski commented on May 16, 2024

Later in the same log:

INFO> health check start
INFO> node_state: INITIALIZING
INFO> register: http://hsds_head:5100/register
INFO> http_post('http://hsds_head:5100/register', {'id': 'sn-5d182', 'port': 80, 'node_type': 'sn'})
INFO> Initiating TCPConnector with limit 100 connections
INFO> http_post status: 500
WARN> POST request error for url: http://hsds_head:5100/register - status: 500
ERROR> HEAD node seems to be down.
Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7f69326b5dc0>
Task exception was never retrieved
future: <Task finished name='Task-3' coro=<healthCheck() done, defined at /usr/local/lib/python3.8/site-packages/hsds/basenode.py:414> exception=SystemExit(1)>
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/hsds/basenode.py", line 87, in register
    rsp_json = await http_post(app, req_reg, data=body)
  File "/usr/local/lib/python3.8/site-packages/hsds/util/httpUtil.py", line 157, in http_post
    raise HTTPInternalServerError()
aiohttp.web_exceptions.HTTPInternalServerError: Internal Server ErrorDuring handling of the above exception, another exception occurred:Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/aiohttp/web.py", line 419, in run_app
    loop.run_until_complete(_run_app(app,
  File "/usr/local/lib/python3.8/asyncio/base_events.py", line 603, in run_until_complete
    self.run_forever()
  File "/usr/local/lib/python3.8/asyncio/base_events.py", line 570, in run_forever
    self._run_once()
  File "/usr/local/lib/python3.8/asyncio/base_events.py", line 1859, in _run_once
    handle._run()
  File "/usr/local/lib/python3.8/asyncio/events.py", line 81, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.8/site-packages/hsds/basenode.py", line 433, in healthCheck
    await register(app)
  File "/usr/local/lib/python3.8/site-packages/hsds/basenode.py", line 96, in register
    sys.exit(1)
SystemExit: 1 

from hsds.

jreadey avatar jreadey commented on May 16, 2024

Anything odd in the logs for the head_node?

from hsds.

jpswinski avatar jpswinski commented on May 16, 2024

I didn't see anything pop out to me in the head node logs. I will grab the whole thing the next time it happens. It did seem odd to me that the error in the service node log starts with the line that says that the hsds-servicenode is killed. Not sure what that means.

from hsds.

jpswinski avatar jpswinski commented on May 16, 2024

There was another service node crash. I verified that there was around 7GB of memory available on the system at the time of the crash. Here is the head node log at the time of the crash:

INFO> health check 2020-11-12T06:27:42Z, cluster_state: READY, node_count: 5
INFO> http_get('http://172.25.0.3:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.7:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.5:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.4:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.6:80/info')
INFO> http_get status: 200
INFO> node health check fail_count: 0
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
REQ> POST: /register [hsds_head:5100]
INFO> body: {'id': 'sn-bd03f', 'port': 80, 'node_type': 'sn'}
INFO> register host: 172.25.0.6, port: 59620
Error handling request
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/aiohttp/web_protocol.py", line 418, in start
    resp = await task
  File "/usr/local/lib/python3.8/site-packages/aiohttp/web_app.py", line 458, in _handle
    resp = await handler(request)
  File "/usr/local/lib/python3.8/site-packages/hsds/headnode.py", line 283, in register
    "host": body['ip'],
KeyError: 'ip'
REQ> POST: /register [hsds_head:5100]
INFO> body: {'id': 'sn-f3acc', 'port': 80, 'node_type': 'sn'}
INFO> register host: 172.25.0.6, port: 59624
Error handling request
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/aiohttp/web_protocol.py", line 418, in start
    resp = await task
  File "/usr/local/lib/python3.8/site-packages/aiohttp/web_app.py", line 458, in _handle
    resp = await handler(request)
  File "/usr/local/lib/python3.8/site-packages/hsds/headnode.py", line 283, in register
    "host": body['ip'],
KeyError: 'ip'
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
INFO> health check 2020-11-12T06:27:52Z, cluster_state: READY, node_count: 5
INFO> http_get('http://172.25.0.3:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.7:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.5:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.4:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.6:80/info')
INFO> http_get status: 200
WARN> unexpected node_id: sn-a0076 (expecting: sn-26be6)
INFO> node health check fail_count: 1
WARN> Fail_count > 0, Setting cluster_state from READY to INITIALIZING
REQ> POST: /register [hsds_head:5100]
INFO> body: {'id': 'sn-a0076', 'port': 80, 'node_type': 'sn'}
INFO> register host: 172.25.0.6, port: 59628
INFO> Found free node reference: {'node_number': 0, 'node_type': 'sn', 'host': None, 'port': 80, 'id': None, 'connected': '2020-11-11T20:45:34Z', 'failcount': 0, 'healthcheck': '2020-11-12T06:27:42Z'}
INFO> inactive_node_count: 0
INFO> setting cluster state to READY - was: INITIALIZING
INFO RSP> <200> (OK): /register
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
INFO> health check 2020-11-12T06:28:02Z, cluster_state: READY, node_count: 5

from hsds.

jreadey avatar jreadey commented on May 16, 2024

@jpswinski - are you dynamically changing the number of containers?

I get the same error if I do: docker-compose -f ${COMPOSE_FILE} scale dn=4

from hsds.

jpswinski avatar jpswinski commented on May 16, 2024

I'm not. The crashes occur randomly during normal use. For instance, it will happen in the middle of reading a dataset.

from hsds.

jreadey avatar jreadey commented on May 16, 2024

Ok, I think what is happening is that after the service node crash, docker is respawning the container. Then when the container registers with the head node, there is a big that is causing the head node to crash.

I have a fix for the headnode (4d31769) so that nodes will be able to re-register. Try this and let me know how it goes.

I don't think this will fix the problem with the service node crashing, but let's first verify the head node fix.

from hsds.

jpswinski avatar jpswinski commented on May 16, 2024

I didn't see the head node crash - after reporting the error above, it keeps going and ultimately reports the cluster back in the READY state. For our user facing server, I had to set the restart policy to always so that we can recover from the service node crashes automatically. This is working, and the system restarts the service node after a crash and HSDS is useable again. But the server node crash is still a problem, because it stopping long processing runs right in the middle of processing; requiring them to start over.

from hsds.

jreadey avatar jreadey commented on May 16, 2024

@jpswinski - do you still have a problem with service node restarts?

from hsds.

jpswinski avatar jpswinski commented on May 16, 2024

No! It looks like the problem was the amount of memory docker allowed for the container. If it reaches that limit, it will kill the container and restart it.

from hsds.

jreadey avatar jreadey commented on May 16, 2024

Looks like this is resolved - closing

from hsds.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.