Git Product home page Git Product logo

Comments (5)

PeterFaiman avatar PeterFaiman commented on August 20, 2024 1

This can happen if you have two jobs running at the same time that both are trying to commit more than one row with the same primary key as each other. That is, if job 1 wants to write to PK 3, and job 2 wants to write to PK 3, that's fine. But if both want to write to PK 3 and PK 7, then they can potentially get in a deadlock if they do that out of order with respect to each other.

For example:

  1. job 1 writes to PK 3
  2. job 2 writes to PK 7
  3. job 1 tries to write to PK 7 and blocks
  4. job 2 tries to write to PK 3 and blocks
  5. deadlock
  6. one job experiences lock wait timeout

You can resolve this two ways:

  1. don't do that
  2. sort by primary key in your job before you call .saveToMemSQL

Doing option 2 results in a new sequence of events:

  1. job 1 writes to PK 3
  2. job 2 tries to write to PK 3 and blocks
  3. job 1 writes to PK 7
  4. job 1 finishes
  5. job 2 unblocks and writes to PK 3 and PK 7

from singlestore-spark-connector.

Shasidhar avatar Shasidhar commented on August 20, 2024

@rick-memsql any update on this issue. Looks like a basic problem of handling upserts (Combination of Insert/Update) which Cassandra supports by default.

from singlestore-spark-connector.

Shasidhar avatar Shasidhar commented on August 20, 2024

@PeterFaiman

Thanks for your detailed explanation.

From the two ways that you suggested.

  1. Dont do that -> Not possible, we want to update the data, and we are just running batch jobs under DStream. There is no real control for programmer here.
  2. Sort by primary key -> Since there is no Direct stream APIs for the connectors. we have to go to RDD under DStream and then do this sorting. But in a streaming job this extra set of sorting will increase the time latency into processing. Sorting itself is a costly operation. If we look for sorting an RDD is a bit costly operation indeed.

from singlestore-spark-connector.

PeterFaiman avatar PeterFaiman commented on August 20, 2024

It only matters that the keys within one transaction are sorted. So if you can pull 10k keys at a time from the final stage of the job, save those, and repeat, that will work just as well. It will also dramatically reduce sorting time.

from singlestore-spark-connector.

lucyyu avatar lucyyu commented on August 20, 2024

Workaround above should work

from singlestore-spark-connector.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.