Comments (23)
It seems like all containers seem to 'stop' as well; all containers seem to stop responding... All the runs we had, also produce no new log entries, however I can see they use CPU and MEM after the crash, however substantially less than before)
from dagster.
I might have found something, however, not sure if this is the cause, as it happens at slightly different times
Error calling event event log consumer handler: consume_new_runs_for_automatic_reexecution
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/dagster/_daemon/auto_run_reexecution/event_log_consumer.py", line 86, in run_iteration
yield from fn(workspace_process_context, run_records)
File "/usr/local/lib/python3.10/site-packages/dagster/_daemon/auto_run_reexecution/auto_run_reexecution.py", line 168, in consume_new_runs_for_automatic_reexecution
for run, retry_number in filter_runs_to_should_retry(
File "/usr/local/lib/python3.10/site-packages/dagster/_daemon/auto_run_reexecution/auto_run_reexecution.py", line 60, in filter_runs_to_should_retry
retry_number = get_retry_number(run)
File "/usr/local/lib/python3.10/site-packages/dagster/_daemon/auto_run_reexecution/auto_run_reexecution.py", line 49, in get_retry_number
if len(run_group_list) >= max_retries + 1:
TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'
from dagster.
The fix #19895 doesn't seem to be related. It crashed again today, at approximately the same time.
An unrelated bug, is that we get the Could not find schedule [...]
thing, which says to turn off the schedule in the UI 'in the status tab' however the OSS version does not have a status tab, nor does the list of schedules show schedules on in the codeserver, making it impossible to remove. however, I don't that that is the cause of this bug.
from dagster.
After finding #8664 i thought maybe this is due to a migration, and I realized there was a new migration - will report back if this fixed it.
from dagster.
The problem is still not fixed. The cycle we have is that the daemons stops giving a heartbeat after approximately a day. Our daemon reports minor warnings like reported in
but otherwise the daemons lose heartbeat silently in the logs until 3600 seconds have gone and the daemons report failures in the UI.
suggests setting the heartbeat tolerance up, but an hour seems like a lot already?
from dagster.
that screenshot is showing that some of the threads are still running - is the daemon process crashing, or just hanging?
Can you share more about your deployment setup / maybe share your docker-compose file - are you launching each run in its own container with the DockerRunLauncher, and are the code servers in separate containers? If not, separating out user code to be happening in different containers from the daemon container would be the maijn recommendation to prevent them from interfering with each other.
from dagster.
It might just be hanging indefinitely. We've raised log level to debug and might see something.
from dagster.
If it’s hanging indefinitely my advice is to try running py-spy dump on it while it is hanging: #14771
you may need to run the docker container with additional permissions for that to work: https://github.com/benfred/py-spy?tab=readme-ov-file#how-do-i-run-py-spy-in-docker
from dagster.
Sorry for not getting back to you @gibsondan - had to both install the relevant dependencies and wait for a crash 😆
Summarized relevant observations:
- the exact time of the incident seems to be delayed a few minutes each day...
- it seems to be stuck trying to connect to the dagster postgres database
- no timeout or log statements are given for over an hour
- It is not the same daemons that crash each day.
The dump is in a gist here
I cannot rule out the fact that our infrastructure changed and creates the incident but i cannot find anyone or anything in the network that would change around noon. If we conclude that that is the case, however, I think everyone would benefit if a timeout of sorts are added to dagster, so you at least know that that is the case.. and/or log statements. We log and capture syslogs, but nothing is captured from the daemons in that timeframe, for obvious reasons.
from dagster.
More debugging i actually found some log statements that point towards our network setup, and found a run that had network issues with the database, where the job was terminated;
In process 20: dagster_postgres.utils.DagsterPostgresException: too many retries for DB connection
And got infrastructure-related error in the stacktrace; yet i am confused why it's not always the case that the job actually terminates. We also have hanging jobs, which i assumed was for unrelated reasons (I assumed the client we wrote had some sort of memory leak or similar), but might be due to the same underlying problem.... I still want to reiterate my confusion that no timeout catches this after an hour.
from dagster.
luckily had a stalled job too; This doesn't seem to await postgres directly, instead seemed to be hanging trying to connect various multiprocessing threads: dump
from dagster.
from dagster.
from dagster.
@gibsondan which of the processes would that be? running (for instance) py-spy dump --pid 18
returns the same output.
from dagster.
from dagster.
On the stalled Run, i am checking pid 1:
root@c231a6118565:/opt/dagster/app# py-spy dump --pid 1
Process 1: /usr/local/bin/python /usr/local/bin/dagster api execute_run
....
Thread 18 (idle): "Thread-7 (_handle_results)"
_handle_results (multiprocessing/pool.py:579)
run (threading.py:953)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Assuming that thread 18
is pid 18 i'd run:
root@c231a6118565:/opt/dagster/app# py-spy dump --pid 18
Process 18: /usr/local/bin/python /usr/local/bin/dagster api execute_run
which seems to have the same output throughout. I might be misunderstanding you, sorry.
I'm not sure how to figure out which thread it might be waiting on, if it's waiting for another thread.
from dagster.
from dagster.
right.
sadly ps
isn't installed in my docker container :D and network access denies my ability to install it "hot".
but ls /proc
gave me something.
There seems to be threads hanging on paramiko (which we use for FTP)
That might be our codebase that is flawed; but could be due to the same underlying network issue, which our code also does not handle properly, which i should fix and add a timeout on...
root@c231a6118565:/opt/dagster/app# py-spy dump --pid 20
Process 20: /usr/local/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=7, pipe_handle=11) --multiprocessing-fork
Python v3.10.9 (/usr/local/bin/python3.10)
Thread 20 (idle): "MainThread"
wait (threading.py:320)
read (paramiko/buffered_pipe.py:150)
recv (paramiko/channel.py:697)
_read_all (paramiko/sftp.py:196)
_read_packet (paramiko/sftp.py:212)
_read_response (paramiko/sftp_client.py:887)
_request (paramiko/sftp_client.py:857)
listdir_attr (paramiko/sftp_client.py:246)
listdir (paramiko/sftp_client.py:218)
listdir (pysftp/__init__.py:592)
get_files_in_folder (mycodebase/resources/ftp_client.py:96)
...
Thread 72 (idle): "Thread-10"
read_all (paramiko/packet.py:308)
read_message (paramiko/packet.py:463)
run (paramiko/transport.py:2159)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
from dagster.
The general story seems to be "We have a network issue".
That is probably not something you can fix - if we end up finding and concluding that - but a timeout on the dagster db connection would still be neat 👍
from dagster.
Just as an update; without actually restarting the daemons (which we've done before) we changed some network connections, and it seems that the dagster daemon was able to detect the failed connection; where it did not hang, but crashed..
from dagster.
Here's an example of postgres dagster.yaml config that adds a connection timeout parameter, that might help in situations like this? I can check with the team about the pros and cons of setting that as a default if none is set.
storage:
postgres:
postgres_db:
username: my_username
password: my_password
hostname: my_hostname
db_name: my_database
port: 5432
params:
connect_timeout: 60 # seconds
from dagster.
Thank you! Just knowing that it's an option, helps - I didn't know that it was possible.
We'll set it and see what happens tomorrow.
from dagster.
@gibsondan The timeout "solved" the problem, in that now the timeout ensures that the connection is retried, whereafter it works.
And with that, I assume that the problem is with our infrastructure. Thank you so much for the help debugging (via pyspy) and finding a solution.
I'd maybe recommend, having a default connect_timout of maybe 10 minutes (or similar) which no connection should take - then future persons will at least know that there is an issue, rather than simply hang forever.
from dagster.
Related Issues (20)
- Option to add asset code as metadata for software defined assets HOT 1
- [email protected] breaks `dagster dev` HOT 9
- ForwardRef._evaluate() missing 1 required keyword-only argument: 'recursive_guard' HOT 5
- Nested Resources in Jobs result in CheckError HOT 1
- Deleting materialization history in Lesson 8: Creating a Partition
- UnicodeDecodeError in -w mode, versions 1.6.4 to 1.7.13 HOT 2
- Lesson 8: ImportError: cannot import name 'utc_datetime_from_timestamp' from 'dagster._utils' HOT 1
- The dagster-pandera integration cannot handle `Annotated` types
- Dagster daemon heartbeat randomly stops HOT 2
- Dagster-slack is missing py.typed marker
- Examples for k8s configuration
- dagster-dbt : unable to load dbt project of an dbt-teradata adapter (when database is None in manifest.json)
- Asset check `@asset_check` docstring not loaded as description
- asset check cannot load an asset that has a prefix
- Implement PipesCloudWatchMessageReader
- Error in "Customizing your Kubernetes Deployment"
- Config field descriptions not loading from attribute docstrings
- Automatic pruning for `TimeWindowPartitionsDefinition` HOT 2
- materializable looker assets
- error with 1.7.14 : 'AssetSpecWithPartitionsDef' object has no attribute 'auto_materialize_policy'
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dagster.