Git Product home page Git Product logo

Comments (12)

tustvold avatar tustvold commented on May 23, 2024 2

I believe this issue can now be closed, as of apache/arrow-rs#2500 parquet has full support for arbitrarily nested types. Feel free to reopen if I have missed something

from datafusion.

alamb avatar alamb commented on May 23, 2024 1

Should the ability to read parquets with nested objects be implemented (only panicking if transformations utilizing that field)

This seems like a valuable addition to me (allowing queries on parquet files that had nested objects but were not read)

, or would it be better to just work on this issue as a whole?

Well of course, supporting queries on the data would be better than just not crashing/erroring when they weren't read :) I think the choice of approach is probably best determined by whoever implements this feature

from datafusion.

alamb avatar alamb commented on May 23, 2024 1

See also the recent blog posts on this topic:

https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encoding-part-1/
https://arrow.apache.org/blog/2022/10/08/arrow-parquet-encoding-part-2/
https://arrow.apache.org/blog/2022/10/17/arrow-parquet-encoding-part-3/

from datafusion.

alamb avatar alamb commented on May 23, 2024 1

Thank you for the report @ShraddhaKishan -- would it be possible to file a ticket in https://github.com/apache/arrow-rs with a reproducer (or at least the parquet file that can not be read)?

from datafusion.

ShraddhaKishan avatar ShraddhaKishan commented on May 23, 2024 1

Sure thing.

from datafusion.

alamb avatar alamb commented on May 23, 2024

Comment from Wes McKinney(wesm) @ 2019-03-14T22:28:39.984+0000:

This is a fairly tricky task (we still don't have this fully done in C++). I'm moving to 0.14 as I expect it to take a little time

Comment from Neville Dipale(nevi_me) @ 2020-11-28T12:41:50.557+0000:

[~andygrove] I'm going through old PRs and closing them. The writer will support nested types to our heart's content, would we need to do anything further to enable this in DataFusion, or can we close this?

Comment from Andy Grove(andygrove) @ 2020-11-28T17:42:28.962+0000:

Thanks [~nevi_me] I filed https://issues.apache.org/jira/browse/ARROW-10761 for the work we need to do in DataFusion

Comment from Andrew Lamb(alamb) @ 2021-04-26T11:23:22.697+0000:

Migrated to github: https://github.com/apache/arrow-rs/issues/39

from datafusion.

Igosuki avatar Igosuki commented on May 23, 2024

Started hacking here https://github.com/Igosuki/arrow-datafusion/tree/map_access
Works for arrays, haven't gotten around to do the physical plan for dictionary because of generics

from datafusion.

lexi-sh avatar lexi-sh commented on May 23, 2024

Is it expected that datafusion cannot currently read parquets with nested objects at all, even if we never utilize the column? While attempting to read a parquet that has a nested object, I get an error because arrow_reader::get_schema returns num_fields counting nested objects as a single field, but row_groups.columns flattens and has more columns than num_fields returns.

Should the ability to read parquets with nested objects be implemented (only panicking if transformations utilizing that field), or would it be better to just work on this issue as a whole?

from datafusion.

houqp avatar houqp commented on May 23, 2024

perhaps we could use #1383 to track the issue of handling stats for nested types in parquet. For query on nested field, didn't @Igosuki already added the support for this? What is left to be done here?

from datafusion.

houqp avatar houqp commented on May 23, 2024

perhaps one of the remaining items would be supported nested columns in physical_plan::Statistics

from datafusion.

Igosuki avatar Igosuki commented on May 23, 2024

The indexed map access code will work on the plan so the only thing the parquet reader has to do is simply deserialize nested structures recursively.
As long as there is a struct for string keys, or a list for int keys at the corresponding level, it will return the proper column.

@houqp I see that support in parquet2 was added (have not tested the arrow2 branch yet) jorgecarleitao/parquet2#64 so it's only a matter of adding it to the reader.
As for arrow-rs it looks like it's still not done https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/array_reader.rs#L803

from datafusion.

ShraddhaKishan avatar ShraddhaKishan commented on May 23, 2024

I believe it still cannot process everything. I was reading a parquet file through ArrowRecordBatchReader and when trying to collect to Vec<RecordBatch> I still get the error that says data type Json not supported in nested map for json writer.

I looked it up further, and found the code originating from within arrow-json/src/writer.rs:351 where we compare the data type of keys with Utf8 and subsequently return an error.

from datafusion.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.