Describe the bug Assume a table test</co

By the way, is the create statement lazy? <p dir="auto"

[BUG] Panic when querying table with wrong partition columns order about arrow-datafusion HOT 12 CLOSED

MohamedAbdeen21 commented on June 29, 2024

[BUG] Panic when querying table with wrong partition columns order

from arrow-datafusion.

Comments (12)

MohamedAbdeen21 commented on June 29, 2024 2

Sorry for the late response.

We could potentially return an error earlier by checking the schema of any file present during the create external table execution plan. That way we prevent a user from creating a table with an incorrect schema over top of an existing table with a different schema.

I think this is the solution that makes the most sense from the user's perspective, my only problem is that I feel like it's easier said than done.

I'll try to take a look around if I manage to find some free time. For now, I'll leave the assignment empty in case anyone else is interested in picking it up.

from arrow-datafusion.

MohamedAbdeen21 commented on June 29, 2024 1

Yeah, this is irrelevant of the syntax used.

The new syntax is merely translated to the old syntax as to avoid adding more code to the rest of the system, so as long as the issue is outside the parser, syntax used should be irrelevant.

Just to double-check, here's the same query with the new syntax:

CREATE EXTERNAL TABLE test(name string) PARTITIONED BY (year string, month string) STORED AS csv LOCATION 'tmp/';
INSERT INTO test VALUES('name', '2024', '03');

-- notice the different ordering in the PARTITIONED BY clause
CREATE EXTERNAL TABLE test2(name string) PARTITIONED BY (month string, year string) STORED AS csv LOCATION 'tmp/';
SELECT * FROM test2;

I'd say this is safe to work on regardless of the status of 9599.

from arrow-datafusion.

alamb commented on June 29, 2024 1

By the way, is the create statement lazy?

It is lazy in the sense that it doesn't read data until the table appears in a query. CREATE often does do basic operations like reading metadata to determine schema.

It definitely doesn't buffer all the data into memory, the way an "eager" dataframe API like pandas or polars(eager) would

from arrow-datafusion.

alamb commented on June 29, 2024

Thanks for the report -- I agree this should never panic but rather erorr

from arrow-datafusion.

devinjdangelo commented on June 29, 2024

I haven't dug into exactly what is going wrong here, but it does make me think a breaking change to a hiveQL inspired syntax where partition columns are only declared following the partition by keyword may avoid edge cases like this more easily.

#9599 is open right now to support a new syntax in a backwards compatible way (so would not fix this outright).

Thank your for finding this @MohamedAbdeen21 ! Are you able to trigger a similar panic using the new syntax as well?

from arrow-datafusion.

MohamedAbdeen21 commented on June 29, 2024

By the way, is the create statement lazy?

from arrow-datafusion.

MohamedAbdeen21 commented on June 29, 2024

I'm not sure what to do next regarding this. Do we just prevent the panic and leave it at that? do we perform schema validation if the table exists?

Also keep in mind that if we only prevent the panic, queries such as the following can make a table useless.

CREATE EXTERNAL TABLE test(name string) PARTITIONED BY (year string, month string) STORED AS parquet LOCATION 'tmp/';
-- notice the different ordering in the PARTITIONED BY clause
INSERT INTO test VALUES('name', '2024', '03');

CREATE EXTERNAL TABLE test2(name string) PARTITIONED BY (month string, year string) STORED AS parquet LOCATION 'tmp/';

-- now table have both partitions (year -> month and month -> year)
INSERT INTO test2 VALUES('name', '2024', '03');

-- both fail
SELECT * FROM test;
SELECT * FROM test2;

from arrow-datafusion.

devinjdangelo commented on June 29, 2024

I'm not sure what to do next regarding this. Do we just prevent the panic and leave it at that? do we perform schema validation if the table exists?

We should prevent the panic, but I think it is reasonable behavior to fail the query and return an error. The examples in this issue are creating two tables with different schemas in the same external storage location and writing to both. This effectively corrupts both tables, since now there are two different table definitions mixed in the same physical location. After the inserts, the directory looks like:

/year=2024/month=03
/month=2024/year=03

When the datafusion read path encounters this, the two options would be:

Silently ignore hive style paths which do not conform to the expected schema (so if the partitions are year/month, only scan within the outer year folder, ignoring the outer month folder)
Return an error that unexpected paths were encountered

Depending on the context, I could see either behavior being desired. We could perhaps provide a configuration option that allows users to control how Datafusion will handle unexpected partition paths.

from arrow-datafusion.

MohamedAbdeen21 commented on June 29, 2024

I don't like the first option. If one person creates a table partitioned on year/month and another person reads it as month/year, the second person won't see any updates the first person have done, and vice versa.

I think inserts using different partitions should be rejected, but again, that requires changing both insert and select statements.

I'm trying to think about which will be easier for users to debug and for us to maintain, error-ing on creation or on queries after? Or maybe don't show any errors, but re-order the partition columns in the create statement to match the data on disk

from arrow-datafusion.

devinjdangelo commented on June 29, 2024

I don't like the first option. If one person creates a table partitioned on year/month and another person reads it as month/year, the second person won't see any updates the first person have done, and vice versa.

I see the two tables as distinct tables. They happen to share the same base file path, but reordering the partitions changes the schema and where the files are actually stored within the base path.

You can run into similar issues with non partitioned tables if you write different schemas with different column orderings to the same base path.

We could potentially return an error earlier by checking the schema of any file present during the create external table execution plan. That way we prevent a user from creating a table with an incorrect schema over top of an existing table with a different schema.

from arrow-datafusion.

samuelcolvin commented on June 29, 2024

Would be great to have a new release, I've just run into this too.

from arrow-datafusion.

alamb commented on June 29, 2024

Would be great to have a new release, I've just run into this too.

Added to #10217 tracker

from arrow-datafusion.

[BUG] Panic when querying table with wrong partition columns order about arrow-datafusion HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent