I'm trying read data from kafka and insert it in memsql using insert on duplicate key

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

insert on duplicate key causing locking errors about singlestore-spark-connector HOT 5 CLOSED

Liorba commented on August 20, 2024

insert on duplicate key causing locking errors

from singlestore-spark-connector.

Comments (5)

PeterFaiman commented on August 20, 2024 1

This can happen if you have two jobs running at the same time that both are trying to commit more than one row with the same primary key as each other. That is, if job 1 wants to write to PK 3, and job 2 wants to write to PK 3, that's fine. But if both want to write to PK 3 and PK 7, then they can potentially get in a deadlock if they do that out of order with respect to each other.

For example:

job 1 writes to PK 3
job 2 writes to PK 7
job 1 tries to write to PK 7 and blocks
job 2 tries to write to PK 3 and blocks
deadlock
one job experiences lock wait timeout

You can resolve this two ways:

don't do that
sort by primary key in your job before you call .saveToMemSQL

Doing option 2 results in a new sequence of events:

job 1 writes to PK 3
job 2 tries to write to PK 3 and blocks
job 1 writes to PK 7
job 1 finishes
job 2 unblocks and writes to PK 3 and PK 7

from singlestore-spark-connector.

Shasidhar commented on August 20, 2024

@rick-memsql any update on this issue. Looks like a basic problem of handling upserts (Combination of Insert/Update) which Cassandra supports by default.

from singlestore-spark-connector.

Shasidhar commented on August 20, 2024

@PeterFaiman

Thanks for your detailed explanation.

From the two ways that you suggested.

Dont do that -> Not possible, we want to update the data, and we are just running batch jobs under DStream. There is no real control for programmer here.
Sort by primary key -> Since there is no Direct stream APIs for the connectors. we have to go to RDD under DStream and then do this sorting. But in a streaming job this extra set of sorting will increase the time latency into processing. Sorting itself is a costly operation. If we look for sorting an RDD is a bit costly operation indeed.

from singlestore-spark-connector.

PeterFaiman commented on August 20, 2024

It only matters that the keys within one transaction are sorted. So if you can pull 10k keys at a time from the final stage of the job, save those, and repeat, that will work just as well. It will also dramatically reduce sorting time.

from singlestore-spark-connector.

lucyyu commented on August 20, 2024

Workaround above should work

from singlestore-spark-connector.

Recommend Projects

insert on duplicate key causing locking errors about singlestore-spark-connector HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent