Comments (12)
I believe this issue can now be closed, as of apache/arrow-rs#2500 parquet has full support for arbitrarily nested types. Feel free to reopen if I have missed something
from datafusion.
Should the ability to read parquets with nested objects be implemented (only panicking if transformations utilizing that field)
This seems like a valuable addition to me (allowing queries on parquet files that had nested objects but were not read)
, or would it be better to just work on this issue as a whole?
Well of course, supporting queries on the data would be better than just not crashing/erroring when they weren't read :) I think the choice of approach is probably best determined by whoever implements this feature
from datafusion.
See also the recent blog posts on this topic:
https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encoding-part-1/
https://arrow.apache.org/blog/2022/10/08/arrow-parquet-encoding-part-2/
https://arrow.apache.org/blog/2022/10/17/arrow-parquet-encoding-part-3/
from datafusion.
Thank you for the report @ShraddhaKishan -- would it be possible to file a ticket in https://github.com/apache/arrow-rs with a reproducer (or at least the parquet file that can not be read)?
from datafusion.
Sure thing.
from datafusion.
Comment from Wes McKinney(wesm) @ 2019-03-14T22:28:39.984+0000:
This is a fairly tricky task (we still don't have this fully done in C++). I'm moving to 0.14 as I expect it to take a little time
Comment from Neville Dipale(nevi_me) @ 2020-11-28T12:41:50.557+0000:
[~andygrove] I'm going through old PRs and closing them. The writer will support nested types to our heart's content, would we need to do anything further to enable this in DataFusion, or can we close this?
Comment from Andy Grove(andygrove) @ 2020-11-28T17:42:28.962+0000:
Thanks [~nevi_me] I filed https://issues.apache.org/jira/browse/ARROW-10761 for the work we need to do in DataFusion
Comment from Andrew Lamb(alamb) @ 2021-04-26T11:23:22.697+0000:
Migrated to github: https://github.com/apache/arrow-rs/issues/39
from datafusion.
Started hacking here https://github.com/Igosuki/arrow-datafusion/tree/map_access
Works for arrays, haven't gotten around to do the physical plan for dictionary because of generics
from datafusion.
Is it expected that datafusion cannot currently read parquets with nested objects at all, even if we never utilize the column? While attempting to read a parquet that has a nested object, I get an error because arrow_reader::get_schema
returns num_fields counting nested objects as a single field, but row_groups.columns
flattens and has more columns than num_fields
returns.
Should the ability to read parquets with nested objects be implemented (only panicking if transformations utilizing that field), or would it be better to just work on this issue as a whole?
from datafusion.
perhaps we could use #1383 to track the issue of handling stats for nested types in parquet. For query on nested field, didn't @Igosuki already added the support for this? What is left to be done here?
from datafusion.
perhaps one of the remaining items would be supported nested columns in physical_plan::Statistics
from datafusion.
The indexed map access code will work on the plan so the only thing the parquet reader has to do is simply deserialize nested structures recursively.
As long as there is a struct for string keys, or a list for int keys at the corresponding level, it will return the proper column.
@houqp I see that support in parquet2 was added (have not tested the arrow2 branch yet) jorgecarleitao/parquet2#64 so it's only a matter of adding it to the reader.
As for arrow-rs it looks like it's still not done https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/array_reader.rs#L803
from datafusion.
I believe it still cannot process everything. I was reading a parquet file through ArrowRecordBatchReader
and when trying to collect to Vec<RecordBatch>
I still get the error that says data type Json not supported in nested map for json writer
.
I looked it up further, and found the code originating from within arrow-json/src/writer.rs:351
where we compare the data type of keys
with Utf8
and subsequently return an error.
from datafusion.
Related Issues (20)
- DataFusion weekly project plan (Andrew Lamb) - May 20, 2024
- Advanced example for building an external index for Row Groups *within* parquet files
- DataFusion HashJoin LeftAnti doesn't support null aware anti join
- Incorrect statistics read for `i8` `i16` columns in parquet HOT 3
- DataFusion ignores "column order" parquet statistics specification
- DataFusion reads Date32 and Date64 parquet statistics in as Int32Array HOT 2
- Pass per-field BigQuery `OPTIONS` values to the LogicalPlan's Arrow Schema
- Expand Test Coverage for ScalarUDF's
- Make the configuration for `StreamTable` more generic to support more stream sources
- Support `date_bin` on timestamps with timezone, properly accounting for Daylight Savings Time HOT 8
- Incorrect statistics read for unsigned integer columns in parquet HOT 1
- Incorrect statistics read for binary columns in parquet
- Implement a benchmark for extracting arrow statistics from parquet HOT 1
- Incorrect statistics read for struct array in parquet HOT 1
- PlaceholderRowExec shown when select from union results. HOT 2
- Make it easier to register object stores HOT 2
- MySQL doesn't support the `NULLS FIRST/LAST` clause in `ORDER BY` statements
- Improve performance of extracting statistics from parquet files
- Examples of using `TreeNode` APIs to walk and manipulate LogicalPlans
- The `limit` info lost in the AggregateExec when ser/deser the physical plan HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from datafusion.