Git Product home page Git Product logo

Comments (23)

C0DK avatar C0DK commented on July 19, 2024

It seems like all containers seem to 'stop' as well; all containers seem to stop responding... All the runs we had, also produce no new log entries, however I can see they use CPU and MEM after the crash, however substantially less than before)

from dagster.

C0DK avatar C0DK commented on July 19, 2024

I might have found something, however, not sure if this is the cause, as it happens at slightly different times

Error calling event event log consumer handler: consume_new_runs_for_automatic_reexecution
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/dagster/_daemon/auto_run_reexecution/event_log_consumer.py", line 86, in run_iteration
    yield from fn(workspace_process_context, run_records)
  File "/usr/local/lib/python3.10/site-packages/dagster/_daemon/auto_run_reexecution/auto_run_reexecution.py", line 168, in consume_new_runs_for_automatic_reexecution
    for run, retry_number in filter_runs_to_should_retry(
  File "/usr/local/lib/python3.10/site-packages/dagster/_daemon/auto_run_reexecution/auto_run_reexecution.py", line 60, in filter_runs_to_should_retry
    retry_number = get_retry_number(run)
  File "/usr/local/lib/python3.10/site-packages/dagster/_daemon/auto_run_reexecution/auto_run_reexecution.py", line 49, in get_retry_number
    if len(run_group_list) >= max_retries + 1:
TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'

from dagster.

C0DK avatar C0DK commented on July 19, 2024

The fix #19895 doesn't seem to be related. It crashed again today, at approximately the same time.

An unrelated bug, is that we get the Could not find schedule [...] thing, which says to turn off the schedule in the UI 'in the status tab' however the OSS version does not have a status tab, nor does the list of schedules show schedules on in the codeserver, making it impossible to remove. however, I don't that that is the cause of this bug.

from dagster.

C0DK avatar C0DK commented on July 19, 2024

After finding #8664 i thought maybe this is due to a migration, and I realized there was a new migration - will report back if this fixed it.

from dagster.

bargheertarje avatar bargheertarje commented on July 19, 2024

The problem is still not fixed. The cycle we have is that the daemons stops giving a heartbeat after approximately a day. Our daemon reports minor warnings like reported in

#19937

but otherwise the daemons lose heartbeat silently in the logs until 3600 seconds have gone and the daemons report failures in the UI.

#17506

suggests setting the heartbeat tolerance up, but an hour seems like a lot already?

from dagster.

gibsondan avatar gibsondan commented on July 19, 2024

that screenshot is showing that some of the threads are still running - is the daemon process crashing, or just hanging?

Can you share more about your deployment setup / maybe share your docker-compose file - are you launching each run in its own container with the DockerRunLauncher, and are the code servers in separate containers? If not, separating out user code to be happening in different containers from the daemon container would be the maijn recommendation to prevent them from interfering with each other.

from dagster.

C0DK avatar C0DK commented on July 19, 2024

It might just be hanging indefinitely. We've raised log level to debug and might see something.

from dagster.

gibsondan avatar gibsondan commented on July 19, 2024

If it’s hanging indefinitely my advice is to try running py-spy dump on it while it is hanging: #14771

you may need to run the docker container with additional permissions for that to work: https://github.com/benfred/py-spy?tab=readme-ov-file#how-do-i-run-py-spy-in-docker

from dagster.

C0DK avatar C0DK commented on July 19, 2024

Sorry for not getting back to you @gibsondan - had to both install the relevant dependencies and wait for a crash 😆

Summarized relevant observations:

  1. the exact time of the incident seems to be delayed a few minutes each day...
  2. it seems to be stuck trying to connect to the dagster postgres database
  3. no timeout or log statements are given for over an hour
  4. It is not the same daemons that crash each day.

Todays crash:
image

The dump is in a gist here

I cannot rule out the fact that our infrastructure changed and creates the incident but i cannot find anyone or anything in the network that would change around noon. If we conclude that that is the case, however, I think everyone would benefit if a timeout of sorts are added to dagster, so you at least know that that is the case.. and/or log statements. We log and capture syslogs, but nothing is captured from the daemons in that timeframe, for obvious reasons.

from dagster.

C0DK avatar C0DK commented on July 19, 2024

More debugging i actually found some log statements that point towards our network setup, and found a run that had network issues with the database, where the job was terminated;

In process 20: dagster_postgres.utils.DagsterPostgresException: too many retries for DB connection

And got infrastructure-related error in the stacktrace; yet i am confused why it's not always the case that the job actually terminates. We also have hanging jobs, which i assumed was for unrelated reasons (I assumed the client we wrote had some sort of memory leak or similar), but might be due to the same underlying problem.... I still want to reiterate my confusion that no timeout catches this after an hour.

from dagster.

C0DK avatar C0DK commented on July 19, 2024

luckily had a stalled job too; This doesn't seem to await postgres directly, instead seemed to be hanging trying to connect various multiprocessing threads: dump

from dagster.

gibsondan avatar gibsondan commented on July 19, 2024

from dagster.

gibsondan avatar gibsondan commented on July 19, 2024

from dagster.

C0DK avatar C0DK commented on July 19, 2024

@gibsondan which of the processes would that be? running (for instance) py-spy dump --pid 18 returns the same output.

from dagster.

gibsondan avatar gibsondan commented on July 19, 2024

from dagster.

C0DK avatar C0DK commented on July 19, 2024

On the stalled Run, i am checking pid 1:

root@c231a6118565:/opt/dagster/app# py-spy dump --pid 1
Process 1: /usr/local/bin/python /usr/local/bin/dagster api execute_run

....

Thread 18 (idle): "Thread-7 (_handle_results)"
    _handle_results (multiprocessing/pool.py:579)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)

Assuming that thread 18 is pid 18 i'd run:

root@c231a6118565:/opt/dagster/app# py-spy dump --pid 18
Process 18: /usr/local/bin/python /usr/local/bin/dagster api execute_run

which seems to have the same output throughout. I might be misunderstanding you, sorry.

I'm not sure how to figure out which thread it might be waiting on, if it's waiting for another thread.

from dagster.

gibsondan avatar gibsondan commented on July 19, 2024

from dagster.

C0DK avatar C0DK commented on July 19, 2024

right.

sadly ps isn't installed in my docker container :D and network access denies my ability to install it "hot".

but ls /proc gave me something.

There seems to be threads hanging on paramiko (which we use for FTP)

That might be our codebase that is flawed; but could be due to the same underlying network issue, which our code also does not handle properly, which i should fix and add a timeout on...

root@c231a6118565:/opt/dagster/app# py-spy dump --pid 20
Process 20: /usr/local/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=7, pipe_handle=11) --multiprocessing-fork
Python v3.10.9 (/usr/local/bin/python3.10)

Thread 20 (idle): "MainThread"
    wait (threading.py:320)
    read (paramiko/buffered_pipe.py:150)
    recv (paramiko/channel.py:697)
    _read_all (paramiko/sftp.py:196)
    _read_packet (paramiko/sftp.py:212)
    _read_response (paramiko/sftp_client.py:887)
    _request (paramiko/sftp_client.py:857)
    listdir_attr (paramiko/sftp_client.py:246)
    listdir (paramiko/sftp_client.py:218)
    listdir (pysftp/__init__.py:592)
    get_files_in_folder (mycodebase/resources/ftp_client.py:96)
 
...

Thread 72 (idle): "Thread-10"
    read_all (paramiko/packet.py:308)
    read_message (paramiko/packet.py:463)
    run (paramiko/transport.py:2159)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)

from dagster.

C0DK avatar C0DK commented on July 19, 2024

The general story seems to be "We have a network issue".

That is probably not something you can fix - if we end up finding and concluding that - but a timeout on the dagster db connection would still be neat 👍

from dagster.

C0DK avatar C0DK commented on July 19, 2024

Just as an update; without actually restarting the daemons (which we've done before) we changed some network connections, and it seems that the dagster daemon was able to detect the failed connection; where it did not hang, but crashed..

from dagster.

gibsondan avatar gibsondan commented on July 19, 2024

Here's an example of postgres dagster.yaml config that adds a connection timeout parameter, that might help in situations like this? I can check with the team about the pros and cons of setting that as a default if none is set.

storage:
  postgres:
    postgres_db:
      username: my_username
      password: my_password
      hostname: my_hostname
      db_name: my_database
      port: 5432
      params:
        connect_timeout: 60 # seconds

from dagster.

C0DK avatar C0DK commented on July 19, 2024

Thank you! Just knowing that it's an option, helps - I didn't know that it was possible.

We'll set it and see what happens tomorrow.

from dagster.

C0DK avatar C0DK commented on July 19, 2024

@gibsondan The timeout "solved" the problem, in that now the timeout ensures that the connection is retried, whereafter it works.

And with that, I assume that the problem is with our infrastructure. Thank you so much for the help debugging (via pyspy) and finding a solution.

I'd maybe recommend, having a default connect_timout of maybe 10 minutes (or similar) which no connection should take - then future persons will at least know that there is an issue, rather than simply hang forever.

from dagster.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.