Comments (5)
In our Ray deployment, currently there are some unknown conditions that cause tasks on some preempted nodes to register as "running" and the node to appear in the Ray Dashboard as alive, even though the node is long gone. The node page on the Ray Dashboard displays an empty screen, and the task continues "running" forever.
Hi @terraflops1048576 this seems a Ray bug that we should fix. Could you elaborate more?
from ray.
I don't really have the ability to diagnose what's going on here. Opening the Chrome DevTools on the node page (http://<cluster ip>/#/cluster/nodes/<node id>
) shows:
TypeError: Cannot read properties of undefined (reading '0')
at hc (NodeDetail.tsx:115:30)
at oo (react-dom.production.min.js:157:137)
...
which suggests to me that the cluster can't fetch the information for the node because it's gone. The node IP is unreachable over SSH, which suggests that the node has been preempted.
However, the task continues to show "running" in the Ray Core Dashboard; it's blue. However, it just runs forever and it doesn't terminate. Basically how we encountered this problem is that the tasks appear to run forever, and then clicking on the task to get the node information yields a blank screen. I have screenshots of the problem, but I'm not sure that they're helpful.
from ray.
I should add that I understand that this information is certainly not sufficient to reproduce the bug, and I would love to collect information to track this down -- if I could be told what exactly to gather, because this seems to happen often enough.
I think at least the CLI/API would be a workaround to unstick tasks that get stuck in this state.
from ray.
@terraflops1048576 Ray has health check so if the underlying machine of a Ray node is gone, then Ray will eventually mark the node has dead after few minutes. Is this not the case?
from ray.
This was indeed not the case for some reason when I tried this on Ray 2.12. This caused the running tasks to simply hang. However, I cannot seem to reproduce the issue on Ray 2.24. I suspect the PR #44692 fixed this.
from ray.
Related Issues (20)
- CI test windows://python/ray/tests:test_actor_state_metrics is consistently_failing HOT 10
- [core][experimental] Remove separate asyncio path for accelerated DAGs
- Ray Serve: Fail to create Serve applications HOT 1
- [Data] Remove `InputBuffer` op from exported metrics / dashboard / Jobs page view
- CI test linux://rllib:learning_tests_multi_agent_cartpole_appo_multi_gpu is consistently_failing HOT 5
- [core] Cancelling tasks who is waiting for dependencies hangs result objects
- CI test linux://rllib:learning_tests_multi_agent_cartpole_appo_gpu is consistently_failing HOT 6
- [Data] Execution hangs when there are many operators in the pipeline HOT 2
- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::RolloutWorker.__init__() HOT 2
- [Ray Data] Improve the design and of `_MapWorker`
- [tune] Don't re-evaluate HyperOpt's points_to_evaluate HOT 1
- CI test linux://rllib:learning_tests_multi_agent_cartpole_appo_multi_cpu is flaky
- [Core] Actor supports job lifetime HOT 2
- CI test linux://rllib:learning_tests_multi_agent_cartpole_appo_multi_cpu is flaky
- [core][experimental] Accelerated DAG should execute work on actor's main thread
- [core][experimental] ray.get of accelerated DAG result may not throw exception for MultiOutputNode HOT 3
- CI test linux://rllib:examples/offline_rl/pretrain_bc_single_agent_evaluate_as_multi_agent is consistently_failing HOT 25
- Release test single_node_oom.aws failed HOT 2
- [ Core] cannot serialize polars.LazyFrame HOT 1
- EOFError error during remote_worker_envs flags HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ray.