Describe the bug I am attempting to optimize an inner join on two

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

I am not quite sure what is going on here -- maybe <a class="user-mention notranslate"

inner join involving hive-partitioned parquet dataset and filters on LHS and RHS causes panic about arrow-datafusion HOT 10 OPEN

jwimberl commented on August 20, 2024

inner join involving hive-partitioned parquet dataset and filters on LHS and RHS causes panic

from arrow-datafusion.

Comments (10)

jwimberl commented on August 20, 2024 1

Oops, apologies for the misleading noise in my previous comment -- I can't exactly reproduce what I did wrong but I must have run some subtly different query when using the DataFusion python modules. The panic occurs identically in 34 and 36, whether using the python module or my project using the equivalent crates.

from arrow-datafusion.

alamb commented on August 20, 2024

Hi @jwimberl -- I wonder if this is related to #7848 which was fixed in #8020 by @korowa

What was happening there was that the entire join output was created in a single record batch
Looks like 34.0.0 was released prior to #8020

Any chance you can try this with a newer version fo DataFusion?

from arrow-datafusion.

jwimberl commented on August 20, 2024

Yes, is 36.0.0 the first version to include this fix?

from arrow-datafusion.

alamb commented on August 20, 2024

Yes, is 36.0.0 the first version to include this fix?

Yes, that is my understanding of the release notes: https://github.com/apache/arrow-datafusion/blob/main/dev/changelog/36.0.0.md#3600-2024-02-16

from arrow-datafusion.

jwimberl commented on August 20, 2024

I'm not able to run the query with 36.0.0. Perhaps there was some breaking change in 35.0.0 or 36.0.0 (can't find one that looks relevant in the release notes) that affected the syntax for how hive-partitioned parquet datasets are registered? Because now, after loading one such dataset as an external table, via e.g.

CREATE EXTERNAL TABLE test
STORED AS PARQUET
PARTITIONED BY (part1, part2)
LOCATION '/path/to/dataset/*/*/chunk.parquet'

the columns of the table are only the hive partition columns part and part2, and the columns of the parquet files themselves are not present in the session context table. Is the [ (<column_definition>) ] part of the external table syntax now required in cases like this?

from arrow-datafusion.

alamb commented on August 20, 2024

I am not quite sure what is going on here -- maybe @devinjdangelo or @metesynnada remembers

37.0.0 also has significant changes in these areas so maybe things will work in 37.0.0 🤔

from arrow-datafusion.

jwimberl commented on August 20, 2024

If it is helpful I can try to produce a self-contained repro

from arrow-datafusion.

alamb commented on August 20, 2024

If it is helpful I can try to produce a self-contained repro

Yes that would be most helpful. I am not sure how to make this ticket actionable otherwise

from arrow-datafusion.

jwimberl commented on August 20, 2024

Perhaps the issue was with the syntax for the partitioned parquet file location(s). I did not change the wildcard based syntax I had been using before, shown at the top of this issue. However, I see in the docs now that the location only need be the parent directory of the top-level partition:

https://arrow.apache.org/datafusion/user-guide/sql/ddl.html#cautions-when-using-the-with-order-clause

Using a version of that example NYC taxi data (I actually just downloaded one file), I did verify that the wildcard based location doesn't produce an error during table creation but does not work. The error is slightly different -- the table is completely empty -- but maybe it's the same issue

## $ wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet
--2024-04-04 18:38:48--  https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet
Resolving d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)... 13.32.192.124, 13.32.192.116, 13.32.192.190, ...
Connecting to d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)|13.32.192.124|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 49961641 (48M) [binary/octet-stream]
Saving to: ‘yellow_tripdata_2024-01.parquet’

yellow_tripdata_2024-01.parquet           100%[=====================================================================================>]  47.65M  79.9MB/s    in 0.6s

2024-04-04 18:38:48 (79.9 MB/s) - ‘yellow_tripdata_2024-01.parquet’ saved [49961641/49961641]

## $ mkdir -p year=2024/month=01
## $ mv yellow_tripdata_2024-01.parquet year\=2024/month\=01/tripdata.parquet
## $ python3
Python 3.11.8 | packaged by conda-forge | (main, Feb 16 2024, 20:53:32) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import datafusion as df
>>> df.__version__
'36.0.0'
>>> ctx = df.SessionContext()
>>> ctx.sql("""
... CREATE EXTERNAL TABLE taxi
... STORED AS PARQUET
... PARTITIONED BY (year, month)
... LOCATION '/path/to/nyctaxi';
... """)
DataFrame()
++
++
>>> ctx.sql("SELECT COUNT(*) FROM taxi;")
DataFrame()
+----------+
| COUNT(*) |
+----------+
| 2964624  |
+----------+
>>> ctx.sql("""
... CREATE EXTERNAL TABLE taxi2
... STORED AS PARQUET
... PARTITIONED BY (year, month)
... LOCATION '/path/to/nyctaxi/*/*/tripdata.parquet';
... """)
DataFrame()
++
++
>>> ctx.sql("SELECT COUNT(*) FROM taxi2;")
DataFrame()
+----------+
| COUNT(*) |
+----------+
| 0        |
+----------+
>>>

I don't know if the wildcard syntax I used was what this DDL documentation page used to show or if I got the syntax wrong from the get-go, and it just happened to be an unsupported format that worked up until now? Naturally I'm attempting the query again with my original dataset, but since its a rather large amount of data re-recreating the external table will take some time.

from arrow-datafusion.

jwimberl commented on August 20, 2024

OK, with DataFusion 36.0.0, I still get a panic when running this query in its original context -- which is a rust module using the datafusion crate and its dependencies. However, I don't get it when using either DataFusion 34.0.0 or 36.0.0 python modules (which I understand are wrappers around the same Rust code). Perhaps this points to some interaction that occurs for the set of dependency versions I'm using that differs from the set involved in the python module? I can provide the cargo tree that I am using, build tool versions, or other information that might be helpful.

from arrow-datafusion.

inner join involving hive-partitioned parquet dataset and filters on LHS and RHS causes panic about arrow-datafusion HOT 10 OPEN

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent