we were using Spark connector V2 from quite some time and had it benchmarked, we recen

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

do you mean this fix <a href="https://github.com/memsql/memsql-spark-connector/commit/

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

V3 is too slow as compared to V2 about singlestore-spark-connector HOT 29 CLOSED

DeanJain commented on August 20, 2024

V3 is too slow as compared to V2

from singlestore-spark-connector.

Comments (29)

DeanJain commented on August 20, 2024 2

thanks Team, here is a quick update...

with below fixed version of V3, we did thorough resting and comparison with V2 connector and found below:

Data Load Performance was 10% faster than V2
Aggregator load balancing was working with this new fixed version, V2 doesnt has this feature.

<dependency>
                <groupId>com.memsql</groupId>
                <artifactId>memsql-spark-connector_2.11</artifactId>
                <version>3.0.5-spark-2.4.4</version>
            </dependency>

from singlestore-spark-connector.

carlsverre commented on August 20, 2024 1

Thanks for the reply - we will investigate, repro, and get back to you soon.

from singlestore-spark-connector.

carlsverre commented on August 20, 2024 1

So before I celebrate too much it sounds like the issues you had from before are resolved and you have found some improvements from the V2 connector? That's fantastic! Thanks for letting us know!

from singlestore-spark-connector.

carlsverre commented on August 20, 2024

@DeanJain thanks for the report! Can you try using the latest version of our connector (3.0.5)? We fixed a pretty big issue related to large scale ingest which may directly address this problem. If that doesn't help can you:

experiment with different compression settings
confirm whether or not you are using the on duplicate key feature

Thanks!

from singlestore-spark-connector.

DeanJain commented on August 20, 2024

do you mean this fix commit ?

its a different issue then load balancing, even in V2 connector all the ingestion request are going to only one aggregator but its still 2-3X faster, so even though this load balancing fix will help improve the performance, but still if you do apple to apple comparison then still there seems to be a bug in V3 which is giving it much slower performance.

from singlestore-spark-connector.

carlsverre commented on August 20, 2024

Thanks for the confirmation. Can you provide an example of how you are writing data? I assume you are using the same driver code to load data using each of the connector versions? Also please provide details on the following questions:

are you using the on duplicate key feature
which compression setting are you using
provide a list of all non-sensitive (i.e. password) memsql configuration variables for each of the driver versions

from singlestore-spark-connector.

DeanJain commented on August 20, 2024

yes its exactly same code / same driver on both v2 and v3. both uses same data writes:

dataset.write().format("memsql").mode("overwrite").save(tableName);

we are not using "on duplicate key"
we are using default setting for compression

V2 Spark MemSQL config:

memsql.host=xxx
memsql.port=xxxx
memsql.user=xxxx
memsql.password=xxxx
memsql.defaultDatabase=xxxx

V3 Spark MemSQL config:

memsql.ddlEndpoint=host:port
memsql.dmlEndpoint=MA,CA:port
memsql.user=xxx
memsql.password=xxxx
memsql.defaultDatabase=xxx

from singlestore-spark-connector.

blinov-ivan commented on August 20, 2024

Hi @DeanJain,
we tested the performance of both v2 and v3 spark-connector and they became almost identical (+-1-2%).
Such a big difference may be caused by some misconfiguration.
How can we help you to resolve this issue?

Best Regards

from singlestore-spark-connector.

DeanJain commented on August 20, 2024

Thanks, The difference is visible when you test on high volume, 300GB+ 2-3 Billion records... did you tested and compared both with such high volume ?

from singlestore-spark-connector.

blinov-ivan commented on August 20, 2024

Hi, @DeanJain ,
we just finished testing with high volume (1B rows, 160GB) and here's the results:
V3 - 18.60 minutes V2 - 16.33 minutes
Could you please provide some details of your test flow?
Are you reading data from some other source and trying to write it to memsql? Could you please provide all the details.
Thanks!

from singlestore-spark-connector.

DeanJain commented on August 20, 2024

Hey @blinov-ivan - both the V2 and V3 jobs have same code except connector dependency, both are reading data from Hive and loading into DataSet and then Saving to MemSQL.

let me know if you need any more / specific details, thanks

from singlestore-spark-connector.

blinov-ivan commented on August 20, 2024

@DeanJain we continued to test the performance and did not find anything similar to your results.
Currently, we have an assumption that the issue could be related to table indices.
V2 by default creates a rowstore table but V3 creates columnstore table.
Could you please clarify how do you create the tables? Are you writing both to rowstore table?
And could you also provide the schema of your testing data?
Thanks!

from singlestore-spark-connector.

DeanJain commented on August 20, 2024

Both V2 and V3 Connectors writes to the same MemSQL Columnstore Table, The table is precreated and is an existing table on memSQL, Spark Connector is just loading data into this existing table its not creating it on the fly, schema is like below:

CREATE TABLE `xxxx_columnstore` (
  `xid` binary(18) NOT NULL,
.
. blah blah blah columns,
.
,
  KEY `keyxxx` (`c1`,`c2`,`c3`,`c4`)  USING CLUSTERED COLUMNSTORE,
  SHARD KEY `xid` (`xid`)
)

from singlestore-spark-connector.

DeanJain commented on August 20, 2024

Both connectors are doing same operation:

dataset.write().format("memsql").mode("overwrite").save("xxxx_columnstore");

from singlestore-spark-connector.

carlsverre commented on August 20, 2024

@DeanJain can you send us the full table schema? We just need all the types/indexes/shard keys/etc - don't care about the column names.

Thanks

from singlestore-spark-connector.

carlsverre commented on August 20, 2024

Oh it looks like we have your schema on file with our support team. Will use that to try to reproduce your results. Thanks

from singlestore-spark-connector.

blinov-ivan commented on August 20, 2024

@DeanJain sorry for taking so long,
but could you also provide the schema of the data frame you are loading.
We ask for it because we assume you're loading into binary(*) field of MemSQL not BinaryType but some other Spark type.
Is it correct?
Thanks

from singlestore-spark-connector.

DeanJain commented on August 20, 2024

All columns on Spark are DataTypes.StringType and this driver/common code is same for both V2 and V3

from singlestore-spark-connector.

blinov-ivan commented on August 20, 2024

@DeanJain looks like that could be an issue.
If you are using mode Overwrite in V3 spark-connector, it will drop and recreate the table based on your dataFrame schema. In your case, it will create the table with only text fields type. Writing to such table is slower than to the schema you provided.
To resolve this issue you could:

Use Append mode instead of Overwrite. In this case you will add new rows instead of overriding them.
Use option overwriteBehavior=merge. like this:

df.write
    .format("memsql")
    .mode(SaveMode.Overwrite)
    .option("overwriteBehavior", "merge")
    .save(s"database.table")

Using this option, spark-connector won't drop and create the table, it will just replace old rows with new ones based on primary keys.
Could you please try this approaches?
Thanks!

from singlestore-spark-connector.

DeanJain commented on August 20, 2024

This is what the documentation says on your site: https://docs.singlestore.com/v7.1/third-party-integrations/spark-3-connector/

"SaveMode.Overwrite means that when saving a DataFrame to a data source, if data/table already exists, existing data is expected to be overwritten by the contents of the DataFrame."

it clearly says if table exists its going to overwrite... then this documentation needs to be fixed. or there is a bug in over write it should not drop and create if table exists, it should just update overwrite on existing table.

we also have added .option("overwriteBehavior", "truncate") with it, so assumption was it should not drop.

Below is what we also tried with V3:

df.write
.format("memsql")
.mode(SaveMode.Overwrite)
.option("overwriteBehavior", "truncate")
.save(s"database.table")

pls clarify...

from singlestore-spark-connector.

DeanJain commented on August 20, 2024

pls let me know which ones of below you want us to try:

// this is what we did at first place and it did not perform:
df.write
.format("memsql")
.mode(SaveMode.Overwrite)
.option("overwriteBehavior", "truncate")
.save("database.table")

df.write
.format("memsql")
.mode(SaveMode.Overwrite)
.option("overwriteBehavior", "merge")
.save("database.table")

df.write
.format("memsql")
.mode(SaveMode.Append)
.save("database.table")

which ones you want us to retry 1/2/3 or all ?

from singlestore-spark-connector.

carlsverre commented on August 20, 2024

If you are using overwriteBehavior=truncate we will not drop/create, we will only truncate.

If you are using merge we will add missing rows and replace rows which collide on the primary key.

Please retry using overwriteBehavior=merge and let us know how it performs. Thanks

from singlestore-spark-connector.

DeanJain commented on August 20, 2024

sure will do, thanks

from singlestore-spark-connector.

blinov-ivan commented on August 20, 2024

Hi @DeanJain, do you have any updates on your performance testing?

from singlestore-spark-connector.

blinov-ivan commented on August 20, 2024

@DeanJain we will close this issue as there is no answer.
Please reopen this ticket if you will have the same issue in the future.

Thanks!

from singlestore-spark-connector.

DeanJain commented on August 20, 2024

Hey Guys, thanks for your help on this, we are busy with multiple major year end releases and will get this tested back again asap, will reopen once i have more details...As of now we are still using Spark-MemSQL V2 connector and will switch it to v3 after we work out this issue.

from singlestore-spark-connector.

DeanJain commented on August 20, 2024

@blinov-ivan - tried below option and getting this error, Do you guys know if we can make this work on a columnstore table with unique key ? any alternatives ?

df.write
.format("memsql")
.mode(SaveMode.Overwrite)
.option("overwriteBehavior", "merge")
.save("database.table")

java.sql.SQLException: (conn=1484366) Feature 'LOAD DATA REPLACE ON COLUMNAR TABLE WITH UNIQUE KEY INDEX' is not supported by MemSQL.
                at org.mariadb.jdbc.internal.util.exceptions.ExceptionMapper.get(ExceptionMapper.java:255)
                at org.mariadb.jdbc.internal.util.exceptions.ExceptionMapper.getException(ExceptionMapper.java:165)
                at org.mariadb.jdbc.MariaDbStatement.executeExceptionEpilogue(MariaDbStatement.java:238)
                at org.mariadb.jdbc.MariaDbStatement.executeInternal(MariaDbStatement.java:356)
Caused by: java.sql.SQLException: Feature 'LOAD DATA REPLACE ON COLUMNAR TABLE WITH UNIQUE KEY INDEX' is not supported by MemSQL.
Query is: LOAD DATA LOCAL INFILE '###.gz' REPLACE INTO TABLE `abc_test` (`act_id`, `org_cd`) MAX_ERRORS 0

from singlestore-spark-connector.

blinov-ivan commented on August 20, 2024

@DeanJain seems like REPLACE doesn't work for columnar tables with a unique key in 7.1 version of SingleStore but it will be possible in 7.3 version.
Meanwhile, you can still try

df.write
.format("memsql")
.mode(SaveMode.Append)
.save("database.table")

or if you will have duplicates in your table you can use onDuplicateKeySQL option.
You can read more about this option here. The code should look similar to the below.

df.write
    .format("memsql")
    .option("onDuplicateKeySQL", "id = id")
    .mode(SaveMode.Overwrite)
    .save("database.table")

Hope it will work for you!

from singlestore-spark-connector.

DeanJain commented on August 20, 2024

Ohh YEAH time to celebrate, we are moving to production with this new fixed V3 connector, things definitely looks much better... thanks again for all your continued support ! @carlsverre @blinov-ivan

from singlestore-spark-connector.

V3 is too slow as compared to V2 about singlestore-spark-connector HOT 29 CLOSED

Comments (29)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent