Dagster version 1.6.3 What's the issue? <p

The fix <a class="issue-link js-issue-link" data-error-text="Failed to load title" dat

After finding <a class="issue-link js-issue-link" data-error-text="Failed to load titl

Sorry for not getting back to you <a class="user-mention notranslate" data-hovercard-t

Dagster Daemon randomly crashes about dagster HOT 23 CLOSED

C0DK commented on July 19, 2024 1

Dagster Daemon randomly crashes

from dagster.

Comments (23)

C0DK commented on July 19, 2024

It seems like all containers seem to 'stop' as well; all containers seem to stop responding... All the runs we had, also produce no new log entries, however I can see they use CPU and MEM after the crash, however substantially less than before)

from dagster.

C0DK commented on July 19, 2024

I might have found something, however, not sure if this is the cause, as it happens at slightly different times

Error calling event event log consumer handler: consume_new_runs_for_automatic_reexecution
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/dagster/_daemon/auto_run_reexecution/event_log_consumer.py", line 86, in run_iteration
    yield from fn(workspace_process_context, run_records)
  File "/usr/local/lib/python3.10/site-packages/dagster/_daemon/auto_run_reexecution/auto_run_reexecution.py", line 168, in consume_new_runs_for_automatic_reexecution
    for run, retry_number in filter_runs_to_should_retry(
  File "/usr/local/lib/python3.10/site-packages/dagster/_daemon/auto_run_reexecution/auto_run_reexecution.py", line 60, in filter_runs_to_should_retry
    retry_number = get_retry_number(run)
  File "/usr/local/lib/python3.10/site-packages/dagster/_daemon/auto_run_reexecution/auto_run_reexecution.py", line 49, in get_retry_number
    if len(run_group_list) >= max_retries + 1:
TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'

from dagster.

C0DK commented on July 19, 2024

The fix #19895 doesn't seem to be related. It crashed again today, at approximately the same time.

An unrelated bug, is that we get the Could not find schedule [...] thing, which says to turn off the schedule in the UI 'in the status tab' however the OSS version does not have a status tab, nor does the list of schedules show schedules on in the codeserver, making it impossible to remove. however, I don't that that is the cause of this bug.

from dagster.

C0DK commented on July 19, 2024

After finding #8664 i thought maybe this is due to a migration, and I realized there was a new migration - will report back if this fixed it.

from dagster.

bargheertarje commented on July 19, 2024

The problem is still not fixed. The cycle we have is that the daemons stops giving a heartbeat after approximately a day. Our daemon reports minor warnings like reported in

#19937

but otherwise the daemons lose heartbeat silently in the logs until 3600 seconds have gone and the daemons report failures in the UI.

#17506

suggests setting the heartbeat tolerance up, but an hour seems like a lot already?

from dagster.

gibsondan commented on July 19, 2024

that screenshot is showing that some of the threads are still running - is the daemon process crashing, or just hanging?

Can you share more about your deployment setup / maybe share your docker-compose file - are you launching each run in its own container with the DockerRunLauncher, and are the code servers in separate containers? If not, separating out user code to be happening in different containers from the daemon container would be the maijn recommendation to prevent them from interfering with each other.

from dagster.

C0DK commented on July 19, 2024

It might just be hanging indefinitely. We've raised log level to debug and might see something.

from dagster.

gibsondan commented on July 19, 2024

If it’s hanging indefinitely my advice is to try running py-spy dump on it while it is hanging: #14771

you may need to run the docker container with additional permissions for that to work: https://github.com/benfred/py-spy?tab=readme-ov-file#how-do-i-run-py-spy-in-docker

from dagster.

C0DK commented on July 19, 2024

Sorry for not getting back to you @gibsondan - had to both install the relevant dependencies and wait for a crash 😆

Summarized relevant observations:

the exact time of the incident seems to be delayed a few minutes each day...
it seems to be stuck trying to connect to the dagster postgres database
no timeout or log statements are given for over an hour
It is not the same daemons that crash each day.

Todays crash:

The dump is in a gist here

I cannot rule out the fact that our infrastructure changed and creates the incident but i cannot find anyone or anything in the network that would change around noon. If we conclude that that is the case, however, I think everyone would benefit if a timeout of sorts are added to dagster, so you at least know that that is the case.. and/or log statements. We log and capture syslogs, but nothing is captured from the daemons in that timeframe, for obvious reasons.

from dagster.

C0DK commented on July 19, 2024

More debugging i actually found some log statements that point towards our network setup, and found a run that had network issues with the database, where the job was terminated;

In process 20: dagster_postgres.utils.DagsterPostgresException: too many retries for DB connection

And got infrastructure-related error in the stacktrace; yet i am confused why it's not always the case that the job actually terminates. We also have hanging jobs, which i assumed was for unrelated reasons (I assumed the client we wrote had some sort of memory leak or similar), but might be due to the same underlying problem.... I still want to reiterate my confusion that no timeout catches this after an hour.

from dagster.

C0DK commented on July 19, 2024

luckily had a stalled job too; This doesn't seem to await postgres directly, instead seemed to be hanging trying to connect various multiprocessing threads: dump

from dagster.

gibsondan commented on July 19, 2024

Can you also run py-spy dump on the subprocess that appears to be hanging there?

…

On Mon, Mar 4, 2024 at 7:21 AM Casper Weiss Bang ***@***.***> wrote: luckily had a stalled job too; This doesn't seem to await postgres directly, instead seemed to be hanging trying to connect various multiprocessing threads: dump <https://gist.github.com/C0DK/6785e8340e9a39efcee7cc37b755672c> — Reply to this email directly, view it on GitHub <#19893 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACAPJC4IVNEQHT3Q6OBPPIDYWRYMRAVCNFSM6AAAAABDPKKVC6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZWGU3DSMBXGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

from dagster.

gibsondan commented on July 19, 2024

(If it’s hanging waiting for multiprocessing, one of its subprocesses is probably also hanging, maybe with the same Postgres connection issue)

…

On Mon, Mar 4, 2024 at 7:22 AM Daniel Gibson ***@***.***> wrote: Can you also run py-spy dump on the subprocess that appears to be hanging there? On Mon, Mar 4, 2024 at 7:21 AM Casper Weiss Bang ***@***.***> wrote: > luckily had a stalled job too; This doesn't seem to await postgres > directly, instead seemed to be hanging trying to connect various > multiprocessing threads: dump > <https://gist.github.com/C0DK/6785e8340e9a39efcee7cc37b755672c> > > — > Reply to this email directly, view it on GitHub > <#19893 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ACAPJC4IVNEQHT3Q6OBPPIDYWRYMRAVCNFSM6AAAAABDPKKVC6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZWGU3DSMBXGE> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

from dagster.

C0DK commented on July 19, 2024

@gibsondan which of the processes would that be? running (for instance) py-spy dump --pid 18 returns the same output.

from dagster.

gibsondan commented on July 19, 2024

It would be one of the child processes of that process.

…

On Mon, Mar 4, 2024 at 7:24 AM Casper Weiss Bang ***@***.***> wrote: @gibsondan <https://github.com/gibsondan> which of the processes would that be? running (for instance) py-spy dump --pid 18 returns the same output. — Reply to this email directly, view it on GitHub <#19893 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACAPJCZIJMUHPD6UZ5WFJWLYWRYZJAVCNFSM6AAAAABDPKKVC6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZWGU3TKMZXHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

from dagster.

C0DK commented on July 19, 2024

On the stalled Run, i am checking pid 1:

root@c231a6118565:/opt/dagster/app# py-spy dump --pid 1
Process 1: /usr/local/bin/python /usr/local/bin/dagster api execute_run

....

Thread 18 (idle): "Thread-7 (_handle_results)"
    _handle_results (multiprocessing/pool.py:579)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)

Assuming that thread 18 is pid 18 i'd run:

root@c231a6118565:/opt/dagster/app# py-spy dump --pid 18
Process 18: /usr/local/bin/python /usr/local/bin/dagster api execute_run

which seems to have the same output throughout. I might be misunderstanding you, sorry.

I'm not sure how to figure out which thread it might be waiting on, if it's waiting for another thread.

from dagster.

gibsondan commented on July 19, 2024

You can see the list of processes and their ids by running “ps aux”. Threads are different from processes so 18 isn’t what you want, that is a thread on process 1 waiting for another process to finish.

…

On Mon, Mar 4, 2024 at 7:31 AM Casper Weiss Bang ***@***.***> wrote: On the stalled Run, i am checking pid 1: ***@***.***:/opt/dagster/app# py-spy dump --pid 1 Process 1: /usr/local/bin/python /usr/local/bin/dagster api execute_run .... Thread 18 (idle): "Thread-7 (_handle_results)" _handle_results (multiprocessing/pool.py:579) run (threading.py:953) _bootstrap_inner (threading.py:1016) _bootstrap (threading.py:973) Assuming that thread 18 is pid 18 i'd run: ***@***.***:/opt/dagster/app# py-spy dump --pid 18 Process 18: /usr/local/bin/python /usr/local/bin/dagster api execute_run which seems to have the same output throughout. I might be misunderstanding you, sorry. I'm not sure how to figure out which thread it might be waiting on, if it's waiting for another thread. — Reply to this email directly, view it on GitHub <#19893 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACAPJCZLNTYWEWGBEVAFX7TYWRZU3AVCNFSM6AAAAABDPKKVC6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZWGU4DSNRYHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

from dagster.

C0DK commented on July 19, 2024

right.

sadly ps isn't installed in my docker container :D and network access denies my ability to install it "hot".

but ls /proc gave me something.

There seems to be threads hanging on paramiko (which we use for FTP)

That might be our codebase that is flawed; but could be due to the same underlying network issue, which our code also does not handle properly, which i should fix and add a timeout on...

root@c231a6118565:/opt/dagster/app# py-spy dump --pid 20
Process 20: /usr/local/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=7, pipe_handle=11) --multiprocessing-fork
Python v3.10.9 (/usr/local/bin/python3.10)

Thread 20 (idle): "MainThread"
    wait (threading.py:320)
    read (paramiko/buffered_pipe.py:150)
    recv (paramiko/channel.py:697)
    _read_all (paramiko/sftp.py:196)
    _read_packet (paramiko/sftp.py:212)
    _read_response (paramiko/sftp_client.py:887)
    _request (paramiko/sftp_client.py:857)
    listdir_attr (paramiko/sftp_client.py:246)
    listdir (paramiko/sftp_client.py:218)
    listdir (pysftp/__init__.py:592)
    get_files_in_folder (mycodebase/resources/ftp_client.py:96)
 
...

Thread 72 (idle): "Thread-10"
    read_all (paramiko/packet.py:308)
    read_message (paramiko/packet.py:463)
    run (paramiko/transport.py:2159)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)

from dagster.

C0DK commented on July 19, 2024

The general story seems to be "We have a network issue".

That is probably not something you can fix - if we end up finding and concluding that - but a timeout on the dagster db connection would still be neat 👍

from dagster.

C0DK commented on July 19, 2024

Just as an update; without actually restarting the daemons (which we've done before) we changed some network connections, and it seems that the dagster daemon was able to detect the failed connection; where it did not hang, but crashed..

from dagster.

gibsondan commented on July 19, 2024

Here's an example of postgres dagster.yaml config that adds a connection timeout parameter, that might help in situations like this? I can check with the team about the pros and cons of setting that as a default if none is set.

storage:
  postgres:
    postgres_db:
      username: my_username
      password: my_password
      hostname: my_hostname
      db_name: my_database
      port: 5432
      params:
        connect_timeout: 60 # seconds

from dagster.

C0DK commented on July 19, 2024

Thank you! Just knowing that it's an option, helps - I didn't know that it was possible.

We'll set it and see what happens tomorrow.

from dagster.

C0DK commented on July 19, 2024

@gibsondan The timeout "solved" the problem, in that now the timeout ensures that the connection is retried, whereafter it works.

And with that, I assume that the problem is with our infrastructure. Thank you so much for the help debugging (via pyspy) and finding a solution.

I'd maybe recommend, having a default connect_timout of maybe 10 minutes (or similar) which no connection should take - then future persons will at least know that there is an issue, rather than simply hang forever.

from dagster.

Dagster Daemon randomly crashes about dagster HOT 23 CLOSED

Comments (23)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent