Git Product home page Git Product logo

Comments (10)

jwimberl avatar jwimberl commented on August 20, 2024 1

Oops, apologies for the misleading noise in my previous comment -- I can't exactly reproduce what I did wrong but I must have run some subtly different query when using the DataFusion python modules. The panic occurs identically in 34 and 36, whether using the python module or my project using the equivalent crates.

from arrow-datafusion.

alamb avatar alamb commented on August 20, 2024

Hi @jwimberl -- I wonder if this is related to #7848 which was fixed in #8020 by @korowa

What was happening there was that the entire join output was created in a single record batch
Looks like 34.0.0 was released prior to #8020

Any chance you can try this with a newer version fo DataFusion?

from arrow-datafusion.

jwimberl avatar jwimberl commented on August 20, 2024

Yes, is 36.0.0 the first version to include this fix?

from arrow-datafusion.

alamb avatar alamb commented on August 20, 2024

Yes, is 36.0.0 the first version to include this fix?

Yes, that is my understanding of the release notes: https://github.com/apache/arrow-datafusion/blob/main/dev/changelog/36.0.0.md#3600-2024-02-16

from arrow-datafusion.

jwimberl avatar jwimberl commented on August 20, 2024

I'm not able to run the query with 36.0.0. Perhaps there was some breaking change in 35.0.0 or 36.0.0 (can't find one that looks relevant in the release notes) that affected the syntax for how hive-partitioned parquet datasets are registered? Because now, after loading one such dataset as an external table, via e.g.

CREATE EXTERNAL TABLE test
STORED AS PARQUET
PARTITIONED BY (part1, part2)
LOCATION '/path/to/dataset/*/*/chunk.parquet'

the columns of the table are only the hive partition columns part and part2, and the columns of the parquet files themselves are not present in the session context table. Is the [ (<column_definition>) ] part of the external table syntax now required in cases like this?

from arrow-datafusion.

alamb avatar alamb commented on August 20, 2024

I am not quite sure what is going on here -- maybe @devinjdangelo or @metesynnada remembers

37.0.0 also has significant changes in these areas so maybe things will work in 37.0.0 🤔

from arrow-datafusion.

jwimberl avatar jwimberl commented on August 20, 2024

If it is helpful I can try to produce a self-contained repro

from arrow-datafusion.

alamb avatar alamb commented on August 20, 2024

If it is helpful I can try to produce a self-contained repro

Yes that would be most helpful. I am not sure how to make this ticket actionable otherwise

from arrow-datafusion.

jwimberl avatar jwimberl commented on August 20, 2024

Perhaps the issue was with the syntax for the partitioned parquet file location(s). I did not change the wildcard based syntax I had been using before, shown at the top of this issue. However, I see in the docs now that the location only need be the parent directory of the top-level partition:

https://arrow.apache.org/datafusion/user-guide/sql/ddl.html#cautions-when-using-the-with-order-clause

Using a version of that example NYC taxi data (I actually just downloaded one file), I did verify that the wildcard based location doesn't produce an error during table creation but does not work. The error is slightly different -- the table is completely empty -- but maybe it's the same issue

## $ wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet
--2024-04-04 18:38:48--  https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet
Resolving d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)... 13.32.192.124, 13.32.192.116, 13.32.192.190, ...
Connecting to d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)|13.32.192.124|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 49961641 (48M) [binary/octet-stream]
Saving to: ‘yellow_tripdata_2024-01.parquet’

yellow_tripdata_2024-01.parquet           100%[=====================================================================================>]  47.65M  79.9MB/s    in 0.6s

2024-04-04 18:38:48 (79.9 MB/s) - ‘yellow_tripdata_2024-01.parquet’ saved [49961641/49961641]

## $ mkdir -p year=2024/month=01
## $ mv yellow_tripdata_2024-01.parquet year\=2024/month\=01/tripdata.parquet
## $ python3
Python 3.11.8 | packaged by conda-forge | (main, Feb 16 2024, 20:53:32) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import datafusion as df
>>> df.__version__
'36.0.0'
>>> ctx = df.SessionContext()
>>> ctx.sql("""
... CREATE EXTERNAL TABLE taxi
... STORED AS PARQUET
... PARTITIONED BY (year, month)
... LOCATION '/path/to/nyctaxi';
... """)
DataFrame()
++
++
>>> ctx.sql("SELECT COUNT(*) FROM taxi;")
DataFrame()
+----------+
| COUNT(*) |
+----------+
| 2964624  |
+----------+
>>> ctx.sql("""
... CREATE EXTERNAL TABLE taxi2
... STORED AS PARQUET
... PARTITIONED BY (year, month)
... LOCATION '/path/to/nyctaxi/*/*/tripdata.parquet';
... """)
DataFrame()
++
++
>>> ctx.sql("SELECT COUNT(*) FROM taxi2;")
DataFrame()
+----------+
| COUNT(*) |
+----------+
| 0        |
+----------+
>>>

I don't know if the wildcard syntax I used was what this DDL documentation page used to show or if I got the syntax wrong from the get-go, and it just happened to be an unsupported format that worked up until now? Naturally I'm attempting the query again with my original dataset, but since its a rather large amount of data re-recreating the external table will take some time.

from arrow-datafusion.

jwimberl avatar jwimberl commented on August 20, 2024

OK, with DataFusion 36.0.0, I still get a panic when running this query in its original context -- which is a rust module using the datafusion crate and its dependencies. However, I don't get it when using either DataFusion 34.0.0 or 36.0.0 python modules (which I understand are wrappers around the same Rust code). Perhaps this points to some interaction that occurs for the set of dependency versions I'm using that differs from the set involved in the python module? I can provide the cargo tree that I am using, build tool versions, or other information that might be helpful.

from arrow-datafusion.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.