Comments (12)
Sorry for the late response.
We could potentially return an error earlier by checking the schema of any file present during the create external table execution plan. That way we prevent a user from creating a table with an incorrect schema over top of an existing table with a different schema.
I think this is the solution that makes the most sense from the user's perspective, my only problem is that I feel like it's easier said than done.
I'll try to take a look around if I manage to find some free time. For now, I'll leave the assignment empty in case anyone else is interested in picking it up.
from arrow-datafusion.
Yeah, this is irrelevant of the syntax used.
The new syntax is merely translated to the old syntax as to avoid adding more code to the rest of the system, so as long as the issue is outside the parser, syntax used should be irrelevant.
Just to double-check, here's the same query with the new syntax:
CREATE EXTERNAL TABLE test(name string) PARTITIONED BY (year string, month string) STORED AS csv LOCATION 'tmp/';
INSERT INTO test VALUES('name', '2024', '03');
-- notice the different ordering in the PARTITIONED BY clause
CREATE EXTERNAL TABLE test2(name string) PARTITIONED BY (month string, year string) STORED AS csv LOCATION 'tmp/';
SELECT * FROM test2;
I'd say this is safe to work on regardless of the status of 9599.
from arrow-datafusion.
By the way, is the create statement lazy?
It is lazy in the sense that it doesn't read data until the table appears in a query. CREATE
often does do basic operations like reading metadata to determine schema.
It definitely doesn't buffer all the data into memory, the way an "eager" dataframe API like pandas or polars(eager) would
from arrow-datafusion.
Thanks for the report -- I agree this should never panic but rather erorr
from arrow-datafusion.
I haven't dug into exactly what is going wrong here, but it does make me think a breaking change to a hiveQL inspired syntax where partition columns are only declared following the partition by keyword may avoid edge cases like this more easily.
#9599 is open right now to support a new syntax in a backwards compatible way (so would not fix this outright).
Thank your for finding this @MohamedAbdeen21 ! Are you able to trigger a similar panic using the new syntax as well?
from arrow-datafusion.
By the way, is the create statement lazy?
from arrow-datafusion.
I'm not sure what to do next regarding this. Do we just prevent the panic and leave it at that? do we perform schema validation if the table exists?
Also keep in mind that if we only prevent the panic, queries such as the following can make a table useless.
CREATE EXTERNAL TABLE test(name string) PARTITIONED BY (year string, month string) STORED AS parquet LOCATION 'tmp/';
-- notice the different ordering in the PARTITIONED BY clause
INSERT INTO test VALUES('name', '2024', '03');
CREATE EXTERNAL TABLE test2(name string) PARTITIONED BY (month string, year string) STORED AS parquet LOCATION 'tmp/';
-- now table have both partitions (year -> month and month -> year)
INSERT INTO test2 VALUES('name', '2024', '03');
-- both fail
SELECT * FROM test;
SELECT * FROM test2;
from arrow-datafusion.
I'm not sure what to do next regarding this. Do we just prevent the panic and leave it at that? do we perform schema validation if the table exists?
We should prevent the panic, but I think it is reasonable behavior to fail the query and return an error. The examples in this issue are creating two tables with different schemas in the same external storage location and writing to both. This effectively corrupts both tables, since now there are two different table definitions mixed in the same physical location. After the inserts, the directory looks like:
/year=2024/month=03
/month=2024/year=03
When the datafusion read path encounters this, the two options would be:
- Silently ignore hive style paths which do not conform to the expected schema (so if the partitions are year/month, only scan within the outer year folder, ignoring the outer month folder)
- Return an error that unexpected paths were encountered
Depending on the context, I could see either behavior being desired. We could perhaps provide a configuration option that allows users to control how Datafusion will handle unexpected partition paths.
from arrow-datafusion.
I don't like the first option. If one person creates a table partitioned on year/month and another person reads it as month/year, the second person won't see any updates the first person have done, and vice versa.
I think inserts using different partitions should be rejected, but again, that requires changing both insert and select statements.
I'm trying to think about which will be easier for users to debug and for us to maintain, error-ing on creation or on queries after? Or maybe don't show any errors, but re-order the partition columns in the create statement to match the data on disk
from arrow-datafusion.
I don't like the first option. If one person creates a table partitioned on year/month and another person reads it as month/year, the second person won't see any updates the first person have done, and vice versa.
I see the two tables as distinct tables. They happen to share the same base file path, but reordering the partitions changes the schema and where the files are actually stored within the base path.
You can run into similar issues with non partitioned tables if you write different schemas with different column orderings to the same base path.
We could potentially return an error earlier by checking the schema of any file present during the create external table execution plan. That way we prevent a user from creating a table with an incorrect schema over top of an existing table with a different schema.
from arrow-datafusion.
Would be great to have a new release, I've just run into this too.
from arrow-datafusion.
Would be great to have a new release, I've just run into this too.
Added to #10217 tracker
from arrow-datafusion.
Related Issues (20)
- [Regression] Query using ARRAY_AGG(DISTINCT) causes panic HOT 5
- Add `ProgressiveEval` operator for optimize `SortPreservingMerge` HOT 3
- SortMergeJoin: The query stuck when join filter is set and more matched rows than batch size
- API to get all `Column` references in an `Expr` without cloning `Columns`
- Strengthen TypeSignature and Coercion rule.
- Excessive memory consumption on sorting HOT 3
- feat: enable optional UDF arguments in `regexp_*` functions HOT 1
- cannot import datafusion-37.1.0 in python 3.8 of windows 7 x64 HOT 2
- Release DataFusion `39.0.0`
- `ScalarVariable` Expr --> String Support
- `IsNull` / `IsNotNull` Expr --> String Support HOT 2
- `TryCast` Expr --> String Support
- `GroupingSet` Expr --> String Support HOT 5
- `Placeholder` Expr --> String Support HOT 1
- `OuterColumnReference` Expr --> String Support HOT 1
- Add an example of how to use the SQL parser/unparser API HOT 2
- Support Substrait VirtualTables
- Discussion: make it easier for specify SQL --> function translation HOT 3
- Create a DataFusion blog HOT 3
- `analysis.rs` bounds check panic HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from arrow-datafusion.