Bug report Required Info: <li

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="22

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Lifecycle destructor calls shutdown while in shuttingdown intermediate state,about ros2/rclcpp

Comments (40)

fujitatomoya commented on June 21, 2024 4

@oysstu @g-arjones @gabrielfpacheco thanks for the quick response.

i will start reverting, i will try some doc and debug print follow-up after reverting.

from rclcpp.

oysstu commented on June 21, 2024 3

I think this is the correct move right now. The context shutting down before the node is cleaned up points to either a programming/logic error or something exceptional happening. It might be best to leave it to the user to decide the actions to take.

LifecycleNode can register a shutdown callback for the context in the lifecycle node constructor and handle the shutdown there. but i think this only works if the node is managed by shared_ptr, so that the context can use weak_ptr to see if the object is still in valid. i think it would be simpler to leave this to user application instead of providing something sometimes works but sometimes does not.

As I mentioned, it's possible to do this if the lifecycle functionality was implemented in a base interface contained in a shared pointer in the lifecycle class. I believe implementing this would enable some generic programming by treating the lifecycle node like a regular node through its base interfaces, but then have the additional lifecycle interface for the specific lifecycle functionality.

Right now, the lifecycle API is a weird mix of functional programming and inheritance. I think it would be better to have the lifecycle class intended for use with inheritance, and disallow the direct callback functions. If the lifecycle functionality was implemented in a base interface it could be added to any node if users prefer the functional style.

from rclcpp.

fujitatomoya commented on June 21, 2024 2

@oysstu @g-arjones @gabrielfpacheco thanks for the discussion and opinions.

all comments here make sense to me.

Calling a subclass method from a base class destructor is just not possible in C++ (which also explains #2554).

this is also true.

Just to expand on that, since it's the subclasses that are initializing/handling devices and sensor states they are the ones that should be responsible for ensuring the states are not unknown (in their dtors and/or on_shutdown)

hmm, so that is user's responsibility, and base class never calls shutdown.

i am not sure at this moment, i am okay to roll back everything but before that i can bring this up with other maintainers.

from rclcpp.

fujitatomoya commented on June 21, 2024 2

@oysstu @g-arjones @gabrielfpacheco just FYI, this is closed.

from rclcpp.

fujitatomoya commented on June 21, 2024 1

#2522 reverts the #2450, i will try once jazzy release is settled.

from rclcpp.

fujitatomoya commented on June 21, 2024 1

@oysstu just fyi,

Only attempt transition in primary states unconfigured, inactive, and active.
Shutdown transition is attempted for intermediate transitions (e.g., shutting down).

this original problem should be now completed as LifecycleNode class.

#2520 (comment) signal case can be discussed more here. (when the context is shutdown gracefully with deferred signal thread)

from rclcpp.

g-arjones commented on June 21, 2024 1

I believe that either context is valid or invalid, we should clean up the lifecycle node to finalize it to avoid leaving the devices unknown state, that is the original issue #997.

As demonstrated in #2553 that issue cannot be addressed from the dtor because the subclasses' dtors are called before the base class' and it's illegal to call methods after the dtor so I think the proper way to address that would be to use the context's shutdown callback as suggested by @oysstu which is what we have been using (and is completely broken now with that change)

from rclcpp.

g-arjones commented on June 21, 2024 1

i am okay to roll back everything but before that i can bring this up with other maintainers.

Thank you, I would really appreciate that. This change created a lot of problems for us.

from rclcpp.

fujitatomoya commented on June 21, 2024 1

@oysstu @g-arjones @gabrielfpacheco
CC: @mjcarroll @wjwwood @clalancette

so i came to the conclusion, that is i will roll back call shutdown in LifecycleNode dtor including all backports.

the reason is pretty much we discussed here. we are generating the another complication (hard to expect and user cannot avoid #2554) to address the one issue(#2553 (comment)). until we can figure out more fine grained control for user application, we should roll back the behavior so that user can be responsible for the lifecycle management.

so at this moment, that is user application responsibility to make sure to call shutdown the device or sensor to avoid leaving them in unknown state. (either calling shutdown in sub-class or using context shutdown callback.)

LifecycleNode can register a shutdown callback for the context in the lifecycle node constructor and handle the shutdown there. but i think this only works if the node is managed by shared_ptr, so that the context can use weak_ptr to see if the object is still in valid. i think it would be simpler to leave this to user application instead of providing something sometimes works but sometimes does not.

as follow-up, I guess something we can improve is doc section that user needs to call shutdown if necessary and warning message just to check the current state and if that is not shut down, we can warn the user that LifecycleNode is not shut down, ... to let the user know this.

what do you think? any concerns and thoughts?

again, thanks for pointing out my oversight and sharing!

from rclcpp.

fujitatomoya commented on June 21, 2024 1

I'm just not sure which loglevel that should go in since for many applications not finalizing a node before terminating the process (or destroying its instance) is perfectly fine and even common practice within many examples and tutorials.

i guess DEBUG. i am not sure if any user application intentionally leaves without shutting down, but i understand your concern that warning could be surprising for user all the sudden.

from rclcpp.

fujitatomoya commented on June 21, 2024

Would it make sense for this to check whether the node is in one of the primary states before sending the shutdown transition?

i think this makes sense. we cannot change the state when it is in transition state anyway.
if that is not in the primary state, we can print the warning that it cannot shutdown during destructor.

Could not publish transition: publisher's context is invalid, at ./src/rcl/publisher.c:423, at ./src/rcl_lifecycle.c:368

lifecycle node com interface cannot be finalized yet, https://github.com/ros2/rcl/blob/f19067726b9bc2d0b90ac0d0f246d1d7821d41ee/rcl_lifecycle/src/rcl_lifecycle.c#L264 since std::unique_ptr<LifecycleNodeInterfaceImpl> impl_; should not be destroyed at LifecycleNode's dtor.

from rclcpp.

fujitatomoya commented on June 21, 2024

Just posting what needs to be done after jazzy release.

revert #2522 (including jazzy, not iron/humble)
add the fix to call shutdown only when lifecycle node is in the primary state. (this fix needs to be applied to iron and humble)

from rclcpp.

fujitatomoya commented on June 21, 2024

So cont to #2520 (comment), i took a quick check at the implementation.

I've seen the following warning quite a bit when terminating a launched node using ctrl+c.

this could be different problem for destruction order. @oysstu can you provide the minimal example to reproduce the issue?

Unable to start transition 7 from current state shuttingdown: Could not publish transition: publisher's context is invalid, at ./src/rcl/publisher.c:423, at ./src/rcl_lifecycle.c:368

this is NOT because the node is transition state. this is because the context is already destroyed before lifecycle node dtor.

this is the same problem that #2522 (comment) reported. this can be observed with rclcpp_lifecycle/test_lifecycle_publisher. and this is the bug in the test code. rclcpp::shutdown() (TearDown) will be called before LifecycleNode::~LifecycleNode, that said the context will be invalid at node dtor, so #2450 generates the error because it tries to publish the state after transition.

1st we need to fix this, #2527

from rclcpp.

fujitatomoya commented on June 21, 2024

@oysstu it would be really appreciated if you can take a look at,

#2527
#2528

from rclcpp.

oysstu commented on June 21, 2024

I'm traveling for the next couple of weeks, but I'll try to follow up as best as I can. I noticed now that the rclcpp context is shut down by the signal handler, and not by me after the executor stops spinning. See the following output:

^C[WARNING] [launch]: user interrupted with ctrl-c (SIGINT)
[WARNING] [launch_ros.actions.lifecycle_node]: Abandoning wait for the '/namespace/node/change_state' service response, due to shutdown.
[node_exec-1] [INFO] [1715459945.276584858] [rclcpp]: signal_handler(signum=2)
[node_exec-1] [DEBUG] [1715459945.276609123] [rclcpp]: signal_handler(): notifying deferred signal handler
[node_exec-1] [DEBUG] [1715459945.276634450] [rclcpp]: deferred_signal_handler(): woken up due to SIGINT/SIGTERM or uninstall
[node_exec-1] [DEBUG] [1715459945.276641853] [rclcpp]: deferred_signal_handler(): shutting down
[node_exec-1] [DEBUG] [1715459945.276674554] [rclcpp]: deferred_signal_handler(): shutting down rclcpp::Context @ 0x5ef029a5b700, because it had shutdown_on_signal == true
[node_exec-1] [DEBUG] [1715459945.276678071] [rcl]: Shutting down ROS client library, for context at address: 0x5ef029a5ba40
[node_exec-1] [DEBUG] [1715459945.276689672] [rcl]: Finalizing publisher
[node_exec-1] [DEBUG] [1715459945.276920120] [rcl]: Publisher finalized
[node_exec-1] [DEBUG] [1715459945.276944746] [rclcpp]: deferred_signal_handler(): waiting for SIGINT/SIGTERM or uninstall
[ERROR] [node_exec-1]: process[node_exec-1] failed to terminate '5' seconds after receiving 'SIGINT', escalating to 'SIGTERM'
[INFO] [node_exec-1]: sending signal 'SIGTERM' to process[node_exec-1]
[node_exec-1] [INFO] [1715459950.335149560] [rclcpp]: signal_handler(signum=15)
[node_exec-1] [DEBUG] [1715459950.335167252] [rclcpp]: signal_handler(): notifying deferred signal handler
[node_exec-1] [DEBUG] [1715459950.335197609] [rclcpp]: deferred_signal_handler(): woken up due to SIGINT/SIGTERM or uninstall
[node_exec-1] [DEBUG] [1715459950.335203910] [rclcpp]: deferred_signal_handler(): shutting down
[node_exec-1] [DEBUG] [1715459950.335207287] [rclcpp]: deferred_signal_handler(): waiting for SIGINT/SIGTERM or uninstall
[node_exec-1] [ERROR] [1715459954.328207103] [namespace.node]: Failed to finish transition 1. Current state is now: unconfigured (Could not publish transition: publisher's context is invalid, at /tmp/makepkg/ros2-iron-base/src/ros2/src/ros2/rcl/rcl/src/rcl/publisher.c:423, at /tmp/makepkg/ros2-iron-base/src/ros2/src/ros2/rcl/rcl_lifecycle/src/rcl_lifecycle.c:368)

Shutdown is called from here:

rclcpp/rclcpp/src/rclcpp/signal_handler.cpp

Line 271 in 343b29b

context_ptr->shutdown("signal handler");

from rclcpp.

oysstu commented on June 21, 2024

Some possible solutions come to mind.

Check if the context/publisher is still valid in the destructor and skip transition if invalid
Register a shutdown callback for the context in the lifecycle node constructor and handle the transitioning there. Remove callback in destructor. This is something I've used to do shutdown transitioning previously.
A combination of the two to ensure transitioning both when the context is shutdown first or destructor is called first

from rclcpp.

fujitatomoya commented on June 21, 2024

I believe that either context is valid or invalid, we should clean up the lifecycle node to finalize it to avoid leaving the devices unknown state, that is the original issue #997.

adding a shutdown callback for the context would work, but if user also adds the preshutdown callback on the context, it would not work as expected order. (we might call Lifecycle::shutdown before user's cleanup callbacks.)

I think we can call the LifecycleNode::shutdown in the destructor, but publishing state is best effort? (not error, but warning) any thoughts?

from rclcpp.

oysstu commented on June 21, 2024

adding a shutdown callback for the context would work, but if user also adds the preshutdown callback on the context, it would not work as expected order. (we might call Lifecycle::shutdown before user's cleanup callbacks.)

Assuming that the shutdown callbacks are called in the reverse order of registration, the callback registered in the constructor would be the last one to be called relative to the lifetime of the node (i.e., any references to the node would be obtained by the user after the constructor). If the shutdown callback handles the shutdown transition on rclcpp::shutdown, and the node destructor handles it otherwise, are not all cases covered? Assuming that the node destructor does nothing if the context/publisher is invalid.

I think we can call the LifecycleNode::shutdown in the destructor, but publishing state is best effort? (not error, but warning) any thoughts?

Hmm, I don't know enough about the lifecycle QoS and it's intended design to comment on this.

from rclcpp.

fujitatomoya commented on June 21, 2024

Assuming that the shutdown callbacks are called in the reverse order of registration,

this is true.

i was thinking that user would do something like, have the pointers for nodes, and reset those pointers in context shutdown pre-shutdown hook to clean everything up since the context is shutting down. and then this LicecycleNode system callback would be calling the invalid memory segment? that was my concern off the top of my head...

maybe there could be more ideas, lets keep this open for the fix! thanks for iterating, i will try to bring this issue for some WG meeting.

from rclcpp.

oysstu commented on June 21, 2024

i was thinking that user would do something like, have the pointers for nodes, and reset those pointers in context shutdown pre-shutdown hook to clean everything up since the context is shutting down. and then this LicecycleNode system callback would be calling the invalid memory segment?

I've used a weak_ptr to refer to the node without extending it's lifetime. This only works when the node is managed by a shared_ptr though. If the lifecycle interface was instead defined in a NodeLifecycleInterface class and contained in the node as a shared_ptr, this could be guaranteed (i.e., similar to the other interfaces such as NodeBaseInterface).

I've also de-registered the context shutdown callback in the lifecycle destructor, so that also should avoid dangling callbacks. That is good if an application creates and deletes a bunch of nodes. This alone is not sufficient though, since there could be a race condition between the destructor and shutdown callback (which is why a shared_ptr/weak_ptr is needed).

from rclcpp.

fujitatomoya commented on June 21, 2024

iron backport: #2543
humble backport: #2544

from rclcpp.

fujitatomoya commented on June 21, 2024

I end up having #2545 as follow up for #2528.

the following backport PRs need to include #2545, so holding.

from rclcpp.

fujitatomoya commented on June 21, 2024

I think the proper way to address that would be to #2520 (comment) is what we have been using (and is completely broken now with that change)

@g-arjones thanks for the information and issue. in that case, what about the context is still in valid? can we just leave the device or sensor with unknown state until context is destroyed?

from rclcpp.

g-arjones commented on June 21, 2024

I think that's the best that can be done by the base class (that was actually what the author of the feature request was referring to since he mentioned CTRL+C). If a more fine-grained control of the state is required then the applications must handle the transitions themselves.

Calling a subclass method from a base class destructor is just not possible in C++ (which also explains #2554).

from rclcpp.

oysstu commented on June 21, 2024

@g-arjones thanks for the information and issue. in that case, what about the context is still in valid? can we just leave the device or sensor with unknown state until context is destroyed?

The C++ RAII paradigm would suggest that since the subclass initializes the resource, it is up to the subclass dtor to handle releasing the resource properly. On a more distributed level, we use a manager node to handle transitioning when not terminating through exceptions or CTRL+C (e.g. something like the nav2 lifecycle manager). For us, the nominal shutdown pattern used is deactivate-cleanup-shutdown unless one of the intermediate transitions fail, in which case the direct shutdown transition is used (e.g. active-shutdown).

from rclcpp.

g-arjones commented on June 21, 2024

Same here 👍

from rclcpp.

g-arjones commented on June 21, 2024

Just to expand on that, since it's the subclasses that are initializing/handling devices and sensor states they are the ones that should be responsible for ensuring the states are not unknown (in their dtors and/or on_shutdown)

from rclcpp.

fujitatomoya commented on June 21, 2024

yeah sorry about that happens, i really appreciate your feedback and comments.

from rclcpp.

g-arjones commented on June 21, 2024

No reason to apologize. Keep up the good work! 👍

from rclcpp.

g-arjones commented on June 21, 2024

@fujitatomoya Thank you for the update 👍

i think it would be simpler to leave this to user application instead of providing something sometimes works but sometimes does not.

Yeah, that's what makes the most sense to me too.

I guess something we can improve is doc section that user needs to call shutdown if necessary

Definitely...

and warning message just to check the current state and if that is not shut down, we can warn the user that LifecycleNode is not shut down, ... to let the user know this.

Yeah, I guess that's fine. I'm just not sure which loglevel that should go in since for many applications not finalizing a node before terminating the process (or destroying its instance) is perfectly fine and even common practice within many examples and tutorials.

from rclcpp.

gabrielfpacheco commented on June 21, 2024

I appreciate your effort for looking into this, @fujitatomoya.

i think it would be simpler to leave this to user application instead of providing something sometimes works but sometimes does not.

It makes sense to me that the user is the one responsible for rightfully finalizing the node since this may be application-specific.

Rolling back and improving documentation seems the right move for me as well.

from rclcpp.

fujitatomoya commented on June 21, 2024

rollback PRs,

Rolling: #2557
Jazzy: #2558
Iron: #2559
Humble: #2560

to close the following issues,

this issue, #2520
#2553
#2554
#997, i would not reopen this instead add comments that user needs to call shutdown at this moment.

from rclcpp.

fujitatomoya commented on June 21, 2024

see #2520 (comment), all related to PRs are rolled back and merged.

i will go ahead to close this issue, feel free to reopen if i miss anything.

from rclcpp.

g-arjones commented on June 21, 2024

@fujitatomoya Thanks a lot! Any idea when the PR to rosdistro will be open?

from rclcpp.

fujitatomoya commented on June 21, 2024

that is something i am not sure, you are talking about humble, right? CC: @audrow

from rclcpp.

g-arjones commented on June 21, 2024

you are talking about humble, right?

Exactly...

from rclcpp.

fujitatomoya commented on June 21, 2024

@audrow @clalancette do we have any plan for humble sync or plan?

from rclcpp.

Lifecycle destructor calls shutdown while in shuttingdown intermediate state about rclcpp HOT 40 CLOSED

Comments (40)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent