Comments (11)
Later in the same log:
INFO> health check start
INFO> node_state: INITIALIZING
INFO> register: http://hsds_head:5100/register
INFO> http_post('http://hsds_head:5100/register', {'id': 'sn-5d182', 'port': 80, 'node_type': 'sn'})
INFO> Initiating TCPConnector with limit 100 connections
INFO> http_post status: 500
WARN> POST request error for url: http://hsds_head:5100/register - status: 500
ERROR> HEAD node seems to be down.
Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7f69326b5dc0>
Task exception was never retrieved
future: <Task finished name='Task-3' coro=<healthCheck() done, defined at /usr/local/lib/python3.8/site-packages/hsds/basenode.py:414> exception=SystemExit(1)>
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/hsds/basenode.py", line 87, in register
rsp_json = await http_post(app, req_reg, data=body)
File "/usr/local/lib/python3.8/site-packages/hsds/util/httpUtil.py", line 157, in http_post
raise HTTPInternalServerError()
aiohttp.web_exceptions.HTTPInternalServerError: Internal Server ErrorDuring handling of the above exception, another exception occurred:Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/aiohttp/web.py", line 419, in run_app
loop.run_until_complete(_run_app(app,
File "/usr/local/lib/python3.8/asyncio/base_events.py", line 603, in run_until_complete
self.run_forever()
File "/usr/local/lib/python3.8/asyncio/base_events.py", line 570, in run_forever
self._run_once()
File "/usr/local/lib/python3.8/asyncio/base_events.py", line 1859, in _run_once
handle._run()
File "/usr/local/lib/python3.8/asyncio/events.py", line 81, in _run
self._context.run(self._callback, *self._args)
File "/usr/local/lib/python3.8/site-packages/hsds/basenode.py", line 433, in healthCheck
await register(app)
File "/usr/local/lib/python3.8/site-packages/hsds/basenode.py", line 96, in register
sys.exit(1)
SystemExit: 1
from hsds.
Anything odd in the logs for the head_node?
from hsds.
I didn't see anything pop out to me in the head node logs. I will grab the whole thing the next time it happens. It did seem odd to me that the error in the service node log starts with the line that says that the hsds-servicenode is killed. Not sure what that means.
from hsds.
There was another service node crash. I verified that there was around 7GB of memory available on the system at the time of the crash. Here is the head node log at the time of the crash:
INFO> health check 2020-11-12T06:27:42Z, cluster_state: READY, node_count: 5
INFO> http_get('http://172.25.0.3:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.7:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.5:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.4:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.6:80/info')
INFO> http_get status: 200
INFO> node health check fail_count: 0
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
REQ> POST: /register [hsds_head:5100]
INFO> body: {'id': 'sn-bd03f', 'port': 80, 'node_type': 'sn'}
INFO> register host: 172.25.0.6, port: 59620
Error handling request
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/aiohttp/web_protocol.py", line 418, in start
resp = await task
File "/usr/local/lib/python3.8/site-packages/aiohttp/web_app.py", line 458, in _handle
resp = await handler(request)
File "/usr/local/lib/python3.8/site-packages/hsds/headnode.py", line 283, in register
"host": body['ip'],
KeyError: 'ip'
REQ> POST: /register [hsds_head:5100]
INFO> body: {'id': 'sn-f3acc', 'port': 80, 'node_type': 'sn'}
INFO> register host: 172.25.0.6, port: 59624
Error handling request
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/aiohttp/web_protocol.py", line 418, in start
resp = await task
File "/usr/local/lib/python3.8/site-packages/aiohttp/web_app.py", line 458, in _handle
resp = await handler(request)
File "/usr/local/lib/python3.8/site-packages/hsds/headnode.py", line 283, in register
"host": body['ip'],
KeyError: 'ip'
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
INFO> health check 2020-11-12T06:27:52Z, cluster_state: READY, node_count: 5
INFO> http_get('http://172.25.0.3:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.7:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.5:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.4:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.6:80/info')
INFO> http_get status: 200
WARN> unexpected node_id: sn-a0076 (expecting: sn-26be6)
INFO> node health check fail_count: 1
WARN> Fail_count > 0, Setting cluster_state from READY to INITIALIZING
REQ> POST: /register [hsds_head:5100]
INFO> body: {'id': 'sn-a0076', 'port': 80, 'node_type': 'sn'}
INFO> register host: 172.25.0.6, port: 59628
INFO> Found free node reference: {'node_number': 0, 'node_type': 'sn', 'host': None, 'port': 80, 'id': None, 'connected': '2020-11-11T20:45:34Z', 'failcount': 0, 'healthcheck': '2020-11-12T06:27:42Z'}
INFO> inactive_node_count: 0
INFO> setting cluster state to READY - was: INITIALIZING
INFO RSP> <200> (OK): /register
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
INFO> health check 2020-11-12T06:28:02Z, cluster_state: READY, node_count: 5
from hsds.
@jpswinski - are you dynamically changing the number of containers?
I get the same error if I do: docker-compose -f ${COMPOSE_FILE} scale dn=4
from hsds.
I'm not. The crashes occur randomly during normal use. For instance, it will happen in the middle of reading a dataset.
from hsds.
Ok, I think what is happening is that after the service node crash, docker is respawning the container. Then when the container registers with the head node, there is a big that is causing the head node to crash.
I have a fix for the headnode (4d31769) so that nodes will be able to re-register. Try this and let me know how it goes.
I don't think this will fix the problem with the service node crashing, but let's first verify the head node fix.
from hsds.
I didn't see the head node crash - after reporting the error above, it keeps going and ultimately reports the cluster back in the READY state. For our user facing server, I had to set the restart policy to always so that we can recover from the service node crashes automatically. This is working, and the system restarts the service node after a crash and HSDS is useable again. But the server node crash is still a problem, because it stopping long processing runs right in the middle of processing; requiring them to start over.
from hsds.
@jpswinski - do you still have a problem with service node restarts?
from hsds.
No! It looks like the problem was the amount of memory docker allowed for the container. If it reaches that limit, it will kill the container and restart it.
from hsds.
Looks like this is resolved - closing
from hsds.
Related Issues (20)
- Use numcodecs VLenArray for encoding variable length dataset chunks
- Support HEAD requests for dataset selection
- Support fixed-length strings with UTF-8 character set HOT 7
- Support broadcasting
- Reset data to fill value when decreasing then increasing extent of dataset
- Save padding/offset of fields in compound types HOT 3
- point selections can fail after shape update
- Support FLETCHER32 filter
- Compound Type Subsetting HOT 4
- Support N-Bit and scale offset filters
- POST_Links does not return link information when following links recursively HOT 5
- Docker Image CI fails intermittently due to bad username/password
- Uninitialized variable length sequences are returned as scalars instead of empty arrays HOT 2
- Uploading 30 GB file takes too long to upload HOT 7
- not enough room in chunk cache - return 503 HOT 21
- Windows encoding error when writing sequence of variable length UTF-8 characters HOT 2
- Scripts use deprecated docker-compose v1
- Vlen sequence of variable-length UTF-8 strings cannot be written HOT 7
- Creation timestamps lack resolution on Windows HOT 2
- MAX_WAIT_TIME for rescan should be a config option HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hsds.