This is an elusive bug and I could use some advice tracking it down. I wrote a Hyperopt py script that optimizes my Gazebo friction/surface/joint/etc parameters. It uses spawn_entity/delete_entity service calls repeatedly to run a simulation of my robot collecting IMU and odometry to compute an optimization score.
It worked great! ...but along the way I've discovered there are a few hiccups to the repeated cycling of a model entity. Something in the tear-down is causing memory corruption that eventually causes a SEGV (in mostly random places unfortunately)....or causes some kind of mutex lock to fail and block forever. I believe the tear-down is complicated by the fact there are a lot of shared-ptrs that may be causing some objects to stay alive longer than expected. I've been banging on the problem for a while and I made some progress, measured in how many training episodes I could do before getting a crash.
I am currently pondering:
- because I am calling in with delete_entity (a service call) that it may be holding onto node references in the service call, which cascade to references on the controllers/plugins which interfere with their tear-down? maybe.
- @ahcorde I don't totally have my head around the controller spawning and executer/spin-thread interaction so perhaps we can review that?
- I dont think this is much of an issue in non-episodal simulations when there is only a single tear-down...i.e. "who cares if it crashes on termination"
I apologize for the insanely long text. Feel free to skip First and Second issue....go to Unresolved.
I am on Ubuntu 20.04. I have tested this on foxy (apt binaries) and latest master branch of ros2 as well.
First Issue | 1-5 episodes
GazeboRosControlPlugin was crashing sometimes due to gazebo events coming in during tear down. I think we should be at least disconnecting from Gazebo first so we dont get any events coming in while we are destructing. I then explicitly call executor->cancel() and thread_executer_spin.join(). The spin thread was set to only exit if rclcpp::ok() was false so I added a flag there to exit the loop.
Second Issue | ~15 episodes
Nodes (rclcpp ones) have many sub-interfaces and many sub-interfaces depend on other subs especially base-node-interface. Thus these base interfaces are constructed in a certain order, but destruction is left up to the compiler. So I modified rclcpp lib and explicitly destructed sub-interfaces in reverse order. This is documented in rclcpp Issue 1469.
Unresolved Issue
I am still getting a crash around 40-60 episodes. This is enough that I've been able to tune my parameters well but I would like to take the opportunity to resolve this issue anyway. There is a little bug somewhere between gazebo plugin and rclcpp just calling out for a fix. :)
Destruction of of Parameter sub-interface after BaseInterface was destroyed
I get this in the log output sometimes during tear-down:
[gzserver-1] [ERROR] [1606372728.971798611] [rclcpp]: Error in destruction of rcl service handle imu_sensor/get_parameters: the Node Handle was destructed too early. You will leak memory
This repeats for all the parameter sub-interface services (get_parameter_types, set_parameters, set_parameters_atomically, describe_parameters, list_parameters) and it's not always the gazebo imu plugin, it can be the odometry (p3d) or one of the ros2_control instantiated controllers.
Tracked this down to the rclcpp Node destruction flow. When the destructor on the Parameter SubInterface is finally called the BaseInterface it requires to tear-down the parameter service interfaces has already been destroyed. The parameter interface holds only a weak-ptr to BaseInterface therefor it doesnt prevent it from being destroyed...it aborts safely but logs and service-interfaces are not properly torn down. I assume something else must be holding on to the Parameter interface but not the BaseInterface or Node itself...but what? ...and is it even a problem with this repo, idk?
Stopping controllers via Controller Manager
I figured I could lessen the tear-down contention by stopping all controllers in the GazeboRosControlPlugin destructor using this code. This code runs after I disconnect from Gazebo events, but the executor and spin thread is still running fine.
std::vector<std::string> start_controllers, stop_controllers;
auto controllers = impl_->controller_manager_->get_loaded_controllers();
std::transform(controllers.begin(), controllers.end(),
std::back_inserter(stop_controllers),
[](const controller_manager::ControllerSpec& spec) { return spec.info.name; });
impl_->controller_manager_->switch_controller(
start_controllers, stop_controllers,
2, true,
rclcpp::Duration(10, 0));
For some reason this blocks forever in switch_controller() call in the "wait until switch is finished" loop, controller_manager.cpp:406. It stepped through and does correctly put my two ros2 controllers in the stop_controllers collection so not sure why it never completes. It doesnt respect the 10s timeout argument either. Unless I am doing something dumb this should probably be moved out into ros2_controls issue.