As per <a class="issue-link js-issue-link" data-error-text="Failed to load title" data

LOL string escaping : <div class="snippet-clipboard-content notranslate

Related: <a class="issue-link js-issue-link" data-error-text="Failed to load title" da

I'm displeased at this time. In testing all seems to go well (See <a class="issue-link

Displease turns to satisfaction: <div class="highlight highlight-source-shell notr

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Well, that is what it's supposed to be doing, yes? I created a test <code class="notra

Yeah, I agree with <a class="user-mention notranslate" data-hovercard-type="user" data

parsing mysqlbinlog output,about github/gh-ost

Comments (16)

shlomi-noach commented on July 19, 2024 1

LOL string ~~escaping~~:

### INSERT INTO `test`.`samplet`
### SET
###   @1=11
###   @2=11
###   @3='I'm here'

from gh-ost.

shlomi-noach commented on July 19, 2024

Related: #4

from gh-ost.

shlomi-noach commented on July 19, 2024

I'm displeased at this time. In testing all seems to go well (See #4), but on production, parsing a true binary log file yields with different (smaller) number of statements as compared to raw binlog parsing.
I will pursue checking which entries are differing, however I'm getting the nasty feeling there will be more hidden gotchas to parsing the binary log file properly.

from gh-ost.

shlomi-noach commented on July 19, 2024

Displease turns to satisfaction:

shlomi-noach@my-test-machine~$ /tmp/gh-osc --debug --mysql-basedir=/usr --mysql-datadir=/home/shlomi-noach/tmp/ --binlog-file=mysql-bin.012323 --internal-experiment=true 2> /dev/null > tmp/statements-gh4.sql
shlomi-noach@my-test-machine~$ /usr/bin/mysqlbinlog --verbose --base64-output=DECODE-ROWS --start-position=4  /home/shlomi-noach/tmp/mysql-bin.012323 | egrep "### (INSERT|UPDATE|DELETE)" | sed -e "s/### //" -e "s/INTO //" -e "s/FROM //" > tmp/statements-mbl4.sql
shlomi-noach@my-test-machine~$ wc -l tmp/statements-gh4.sql
2214601 tmp/statements-gh4.sql
shlomi-noach@my-test-machine~$ wc -l tmp/statements-mbl4.sql
2214601 tmp/statements-mbl4.sql
shlomi-noach@my-test-machine~$ diff tmp/statements-mbl4.sql tmp/statements-gh4.sql
shlomi-noach@my-test-machine~$ md5sum tmp/*4.sql
3ca149d5f53bbb83eb9467ee89adb0fe  tmp/statements-gh4.sql
3ca149d5f53bbb83eb9467ee89adb0fe  tmp/statements-mbl4.sql

So for now this is something we can work with, on production data

from gh-ost.

shlomi-noach commented on July 19, 2024

Uggggggh!

A REPLACE translates into DELETE+UPDATE such that the UPDATE modifies the PK value!

### DELETE FROM `test`.`samplet`
### WHERE
###   @1=2
### UPDATE `test`.`samplet`
### WHERE
###   @1=4
### SET
###   @1=2
###   @2=4
###   @3='replaced 2,4'

This is no fun. This means when we wish to apply the change on the ghost table, we need to verify whether there's been a change to PK (as in the above, so easy to detect). If not, good life, we just REPLACE INTO ghost_table SELECT * FROM original_table WHERE ...

But if there is a change to PK value, we need to DELETE the old row and INSERT a new one.
I would think that in RBR we wouldn't need to do this, but seems like we do.

cc @ggunson

from gh-ost.

jonahberquist commented on July 19, 2024

A REPLACE translates into DELETE+UPDATE such that the UPDATE modifies the PK value!

Huh, that's weird. What statement did you run that ended up with this in the binary log? I'd love to see the SQL somewhere to reproduce the test cases.

from gh-ost.

ggunson commented on July 19, 2024

A REPLACE translates into DELETE+UPDATE such that the UPDATE modifies the PK value!

That sounds really, really wrong. MySQL RBR has gone rogue and is thinning the row herd

from gh-ost.

shlomi-noach commented on July 19, 2024

@jonahberquist see https://github.com/github/gh-osc/pull/4/files#diff-0d909eaf5269b40f9d9afe5ef9fae52cR2

The replace into samplet values (2,4,'replaced 2,4'); is the one that translates into a DELETE+UPDATE, as can be seen further down the file on https://github.com/github/gh-osc/pull/4/files#diff-0d909eaf5269b40f9d9afe5ef9fae52cR146

from gh-ost.

ggunson commented on July 19, 2024

Well, that is what it's supposed to be doing, yes? I created a test samplet table and tried it out: it comes from there being a unique index on license as well as a PK, and the REPLACE hits duplicates on both, and I guess this is what it's expected to do?

mysql> select * from samplet;
+----+---------+-------------+
| id | license | name        |
+----+---------+-------------+
|  1 |       1 | a           |
|  2 |       2 | extended    |
|  3 |       3 | extended    |
|  4 |       4 | transaction |
|  5 |       5 | transaction |
|  6 |       6 | transaction |
+----+---------+-------------+
6 rows in set (0.00 sec)

mysql>     replace into samplet values (2,4,'replaced 2,4');
Query OK, 3 rows affected (0.00 sec)

mysql> select * from samplet;
+----+---------+--------------+
| id | license | name         |
+----+---------+--------------+
|  1 |       1 | a            |
|  2 |       4 | replaced 2,4 |
|  3 |       3 | extended     |
|  5 |       5 | transaction  |
|  6 |       6 | transaction  |
+----+---------+--------------+
5 rows in set (0.00 sec)

from gh-ost.

jonahberquist commented on July 19, 2024

Yeah, I agree with @ggunson. This makes a lot more sense seeing the unique index on column 2. The row with the unique key getting updated and the row with the PK getting deleted is a bit weird to me -- we could have deleted row 4 and updated row 2 and ended up with the same table. I don't really see this as a problem. It's just a bit weird.

Given that it's a correct set of action, if we encountered this, it's basically the same as if we encountered two distinct things:

### DELETE FROM `test`.`samplet`
### WHERE
###   @1=2

and

### UPDATE `test`.`samplet`
### WHERE
###   @1=4
### SET
###   @1=2
###   @2=4
###   @3='replaced 2,4'

And updates and deletes are both things we'd need to be able to handle anyway. The delete is just a delete. The update is just an update. So, not a problem.

Oh. 😯 Wait. 😰

The problem we're talking about now isn't actually in our parsing of the logs. It's in the fact that we'll be issuing REPLACE statements, and they might modify rows differently than just a DELETE and INSERT for the PK. When we were spec'ing this out in person a few weeks ago, I don't remember talking about that aspect of REPLACE behavior.

Is there a time where our ghost table could have data that would lead to a REPLACE replacing more than the single row we want to if that table has a unique secondary key?

Since we'll be applying all of the changes from the binary log in order, I don't think we could, unless the table had already had data that would violate that secondary unique constraint, but this is definitely a corner case that we should make sure we understand.

from gh-ost.

shlomi-noach commented on July 19, 2024

@ggunson the end result -- yes, that's what is supposed to happen. I was more ranting on the way it got there. It could have DELETEd the other row, and then UPDATEd the row with the same PK, so no change to PK.
Or it could have done what happens in reality: issue two DELETEs and one INSERT instead.
The fact it chose to modify a PK value is what upsets me.

from gh-ost.

shlomi-noach commented on July 19, 2024

One implication of parsing the RBR is that we are unable to handle an ALTER to a PK, even if there's another UNIQUE KEY hat is good to go.

@jonahberquist I don't think there's going to be an integrity problem. It's more that given an event such as:

### UPDATE `test`.`samplet`
### WHERE
###   @1=4
### SET
###   @1=2
###   @2=4
###   @3='replaced 2,4'

We would need to be more specific about how to apply this change. It's no longer REPLACE INTO ghost SELECT * FROM original WHERE @1=4, because @1=4 does not exist in the original table anymore. We would instead need to parse the values (in this case), issue a DELETE on @1=4 on the ghost table and then REPLACE INTO ghost SELECT * FROM original WHERE @1=2.

@ggunson envisioned that we would only need to parse the WHERE values and then go apply on the tables based on those, but according to this, we also need to parse the applied values themselves and use them in our logic.

from gh-ost.

ggunson commented on July 19, 2024

I was mostly just pointing out the reason for the behaviour, which wasn't obvious until the secondary unique key was known.

I'm actually a bit weirded out in terms of the MySQL documentation, since it specifically states that a REPLACE is either an INSERT or a DELETE+INSERT, and doesn't say it's any different depending on key violations:

The REPLACE statement returns a count to indicate the number of rows affected. This is the sum of the rows deleted and inserted. If the count is 1 for a single-row REPLACE, a row was inserted and no rows were deleted. If the count is greater than 1, one or more old rows were deleted before the new row was inserted. It is possible for a single row to replace more than one old row if the table contains multiple unique indexes and the new row duplicates values for different old rows in different unique indexes.
(http://dev.mysql.com/doc/refman/5.6/en/replace.html)

However, in my very lame testing with REPLACE, here's what I'm seeing in the row binary logs:

A REPLACE with no key collisions is an INSERT (as expected)
A REPLACE that hits a duplicate primary key becomes DELETE + INSERT (okay too)
A REPLACE that hits a duplicate secondary unique key becomes an UPDATE (wat)
A REPLACE that hits both the primary key and a secondary unique key is DELETE + UPDATE (as @shlomi-noach discovered).

Later on the docs say that, well, the storage engine could be wack, but like, who cares, it's all the same:

It is possible that in the case of a duplicate-key error, a storage engine may perform the REPLACE as an update rather than a delete plus insert, but the semantics are the same.

Which still doesn't point out all the differences we're seeing.

Oddly, my quick test of MyISAM shows it gives more information on the PK plus unique collision situation than the InnoDB version of RBR does (just for testing, I know we don't care about MyISAM):

### DELETE FROM `test`.`samplet2`
### WHERE
###   @1=2 /* INT meta=0 nullable=0 is_null=0 */
###   @2=6 /* INT meta=0 nullable=1 is_null=0 */
###   @3='replace 2, 6' /* VARSTRING(64) meta=64 nullable=1 is_null=0 */
### UPDATE `test`.`samplet2`
### WHERE
###   @1=4 /* INT meta=0 nullable=0 is_null=0 */
###   @2=2 /* INT meta=0 nullable=1 is_null=0 */
###   @3='regular insert of 4,2' /* VARSTRING(64) meta=64 nullable=1 is_null=0 */
### SET
###   @1=2 /* INT meta=0 nullable=0 is_null=0 */
###   @2=2 /* INT meta=0 nullable=1 is_null=0 */
###   @3='replace 2,2' /* VARSTRING(64) meta=64 nullable=1 is_null=0 */
# at 706

from gh-ost.

jonahberquist commented on July 19, 2024

It's no longer REPLACE INTO ghost SELECT * FROM original WHERE @1=4, because @1=4 does not exist in the original table anymore. We would instead need to parse the values (in this case), issue a DELETE on @1=4 on the ghost table and then REPLACE INTO ghost SELECT * FROM original WHERE @1=2.

Yeah, this makes sense to me now. I still don't like it, but at least it makes sense.

Depending on how we want our binlogs, specifically with regard to binlog_row_image, I think we have a few options for handling this.

If we have full row images, when we see an UPDATE we could always parse the WHERE for the PK to delete and then always parse the SET to get the PK to do the REPLACE. (Or do 2 deletes and 1 insert as you mentioned -- in this case we might issue two DELETEs for the same row, which is fine).

With minimal row images, we would only get the columns that change. I noticed this in the sample binlogs you added to #4 (see below). In this case we'd have to check if the PK is listed in the SET columns at all and, if not, just REPLACE using the PK from the WHERE.

### UPDATE `test`.`samplet`
### WHERE
###   @1=5
### SET
###   @3='update'

from gh-ost.

shlomi-noach commented on July 19, 2024

Well, we control the row image. We can force configuring it to full if we choose to.
I'm in fact more inclined at this time to go back to the original proposition of just parsing the query from the RBR log and applying it; in this case I don't care if a REPLACE did or did not cause an UPDATE to a PK.
In the particular case of blobs or anything that seems risky to parse from the RBR, we may resort to applying via INSERT INTO ghost SELECT FROM original.
Hmmm, but the latter means we'll have to comply with this anyway... Further thinking...

from gh-ost.

shlomi-noach commented on July 19, 2024

I'm closing this for now, having nice success with #5

from gh-ost.

parsing mysqlbinlog output about gh-ost HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent