Git Product home page Git Product logo

Comments (19)

gomoripeti avatar gomoripeti commented on June 13, 2024 1

sure, understandable and understood

I just wanted to make a sign-of-life. I will describe the "incomplete record migration" in detail and reproducible way. And it is also totally fine if this edge case does not block release of 3.13.0 and slowly improved in followup patch versions (either by me or by others)

from rabbitmq-server.

dumbbell avatar dumbbell commented on June 13, 2024 1

The more I think about the problem, the more I believe we shouldn't use the new ID format. It duplicates data for no benefit. And while looking into this, I also discovered that the new ID format is already out there for shovels in 3.12.x and 3.11.x as you said.

We put so much effort into making sure the upgrade to Khepri was smooth for the past few years (and there are still rough edges as we can see), that I refuse to add code to federation to "fix" something that shouldn't be there in the first place.

The only acceptable solution to me is to find something to convert new IDs back to their original format in the shovel plugin.

from rabbitmq-server.

michaelklishin avatar michaelklishin commented on June 13, 2024

@gomoripeti can't we do exactly what we did to Shovels? It sounds like the same problem in a different place to me:

#10004, #10096, #9965, #9968

from rabbitmq-server.

gomoripeti avatar gomoripeti commented on June 13, 2024

@michaelklishin indeed this is the exact same problem. But this is not as bad because the id format change was not backported for federation, so all already released versions (except release candidates) still use the old id format. So there is a simpler solution (which is enabled by #10096): "Revert to old id format if khepri is disabled." My reason to prefer this one is because it took about 3 patch releases to get shovels right.

from rabbitmq-server.

michaelklishin avatar michaelklishin commented on June 13, 2024

@gomoripeti nonetheless, we did settle on a certain solution for Shovel. Mnesia ("Khepri is not enabled") will go extinct in the land of RabbitMQ in a single digit number of months.

The hard part is the upgrade path, and it cannot be avoided.

from rabbitmq-server.

gomoripeti avatar gomoripeti commented on June 13, 2024

sorry for the delayed response, finally I had time to spend on this topic

I agree that the upgrade path is hard. I see it like this

3.12 Mnesia -(step 1.)-> 3.13 Mnesia -(step 2.)-> 3.13 Khepri

  • 3.12 Mnesa only supports old child id format
  • 3.13 Mnesia only supported new child id format and there is no conversion in step 1.
    after lot of testing and trial and error I changed my mind about the fix and submitting a PR to map the solution implemented for "shovels"
  • 3.13 Khepri only supports new child id format

In step 2. a record conversion from old child id format to new one was implemented in #10096
However during testing I found out that this is not complete, the supervisor state (in the process) still has the old child ids and this inconsistency leads to various failures after migrating to Khepri
I will submit a separate Issue to report this. Based on previous weeks I cannot commit to working on a fix for this unfortunately.

from rabbitmq-server.

michaelklishin avatar michaelklishin commented on June 13, 2024

Going straight from 3.12 to 3.13 with Khepri is possibly in theory but very unlikely for production systems, and we can document this fact.

I'm not sure what you mean by "based on previous weeks".

from rabbitmq-server.

michaelklishin avatar michaelklishin commented on June 13, 2024

Federation links are not that different from dynamic Shovels:

  • They are started at node boot
  • They have IDs and attach to a mirrored supervisor
  • They store some of their state in the schema data store

Why specifically can't the solution we have for shovels work here?

Also, can you be more specific than the convertion is "not complete"? Claims like that do not help us ship 3.13 sooner in any way, so we either do it in a way that would be problematic for CloudAMQP, or CloudAMQP will miss out on this RabbitMQ version and then 4.0, and all the fundamental improvements that come with not using Mnesia at all in RabbitMQ.

from rabbitmq-server.

michaelklishin avatar michaelklishin commented on June 13, 2024

@gomoripeti I will be blunt: we need more specifics here and either decide to address it, or 3.13.0 will ship as is, and it will be up to the companies that host RabbitMQ as a service to find a solution.

3.13 cannot be delayed for a few more months.

from rabbitmq-server.

gomoripeti avatar gomoripeti commented on June 13, 2024

I opened #10440 to describe the "incomplete record migration" during Khepri migration issue.
This issue could track the 3.12 -> "3.13 Mnesia" upgrade part only, which should be addressed by PR #10416 .

from rabbitmq-server.

michaelklishin avatar michaelklishin commented on June 13, 2024

The 3.12 -> 3.13 Mnesia migration should be handled by #10453. See #10440 for future work related to Khepri and Federation, Shovel supervisor child IDs.

from rabbitmq-server.

dumbbell avatar dumbbell commented on June 13, 2024

Just to clarify a few points, which should make a solution easier to find:

Mnesia ("Khepri is not enabled") will go extinct in the land of RabbitMQ in a single digit number of months.

Very unlikely as it would be removed from RabbitMQ 4.1.0 at the earliest.

Going straight from 3.12 to 3.13 with Khepri is possibly in theory

No, it's not possible: you have to upgrade, then enable khepri_db. To enable khepri_db, all nodes in the cluster must be running and know about khepri_db, like other feature flags.

I'm going to look at other issues and pull requests linked in this one.

from rabbitmq-server.

dumbbell avatar dumbbell commented on June 13, 2024

I have a local prototype that approaches the problem differently: instead of changing the ID format and breaking running supervision trees and causing issues during upgrade, the ID is left alone but is converted to a Khepri-compatible path only when we need it.

This relies on the fact that the Group argument is always a module AFAICT, even if mirrored_supervisor documents that it can be any term.

Diana is in holidays currently, so I can't run the idea by her for now. I will prepare a more complete patch and publish it to GitHub so you can take a look.

from rabbitmq-server.

dumbbell avatar dumbbell commented on June 13, 2024

Here is the branch:
https://github.com/rabbitmq/rabbitmq-server/tree/rework-mirrored_supervisor-child-id

I tested it lightly so far.

What do you think @gomoripeti?

Edit: I changed the branch above. It includes the commit from the previously mentionned branch, plus other commits to address the whole problem with both federation and shovels and a testcase that reproduces the scenario at the beginning of this issue.

from rabbitmq-server.

lukebakken avatar lukebakken commented on June 13, 2024

the ID is left alone but is converted to a Khepri-compatible path only when we need it

👍 👍 👍

from rabbitmq-server.

lukebakken avatar lukebakken commented on June 13, 2024

I am re-opening based on @dumbbell's comment

from rabbitmq-server.

gomoripeti avatar gomoripeti commented on June 13, 2024

Thanks all for looking at this.

  • On the administrative side this issue was addressing only the 3.13 on Mnesia case, and there was a PR from Michael merged addressing this. That's why this issue was closed. There is #10440 which tracks the Khepri path and friends problem.

  • the ID is left alone

    This is fine for federation but shovel plugin has the same issue. And there the new id format is already released in 3.12 and even 3.11. So there might be deployments out there which already have running shovel workers with both old and new child id format. Then we have to think about how to convert the new id format back to the old one.

  • Given the current situation that versions supporting both old and new child id format on Mnesia are already merged for federation and already released for shovels, what do you all think about a solution:
    Restart all mirrored supervisor workers at the start of khepri migration (from the init_copy_to_khepri callback) before copying of records starts. Either by using the assumption that group is a module or by a registration mechanism that Diana implemented.

from rabbitmq-server.

dumbbell avatar dumbbell commented on June 13, 2024

The work in progress to address that upgrade problem is tracked in pull request #10472.

from rabbitmq-server.

dumbbell avatar dumbbell commented on June 13, 2024

Fixed by #10472.

from rabbitmq-server.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.