Git Product home page Git Product logo

Comments (5)

j-bennet avatar j-bennet commented on August 14, 2024 1

@yohplala

  • while you refer in a generic manner to Arrow data types, the PR you point to refers specifically to Arrow string, right?

Yes, strings being the most valuable use case, we wanted to make sure we can support them in dask as soon as possible. But eventually, we'll want to support other arrow-backed data types.

Thank you @yohplala, I don't know how I missed that one - probably searched for pyarrow vs arrow.

from fastparquet.

yohplala avatar yohplala commented on August 14, 2024

Hi @j-bennet ,
@martindurant will obviously be of more help, but meanwhile, I understand that:

  • while you refer in a generic manner to Arrow data types, the PR you point to refers specifically to Arrow string, right?
  • the struct Arrow string, known to be lighter than python string, is already spotted in issue #640

Bests,

from fastparquet.

martindurant avatar martindurant commented on August 14, 2024

It is true that we don't know our userbase very well, but I think it is fair to say that most people use fastparquet over arrow because they don't want to install both. There remain a few features that we have and they don't, but not many.

As @yohplala points out, generic (nested, variable-length) arrow support is possible but complex and I don't think on anyone's map - especially since the main use case is via pandas, which has nothing useful it can do with those types. If we were to do this, it would be more likely for awkward, which has very similar layout in memory, but significantly better API to deal with. (awkward also has some more flexibility, allowing start/stop indexing into a buffer rather than just offsets, for instance)

As for strings, there is indeed a bigger possible improvement to be made, and pandas is already much better integrated with the arrow backend (for string->string and string->simple type operations). Nevertheless, I don't think we're going to work on this and a possible/optional arrow dependency without significant appetite from the community.

from fastparquet.

yohplala avatar yohplala commented on August 14, 2024

Hello @martindurant
I hope you are well. It is sometime I have not come here around.
You mentioned previously:

It is true that we don't know our userbase very well, but I think it is fair to say that most people use fastparquet over arrow because they don't want to install both. There remain a few features that we have and they don't, but not many.

Martin, you may have spotted in the last pandas 2.2 release notes this comment about PyArrow.

PyArrow will become a required dependency with pandas 3.0 to accommodate this change.

I am raising this as maybe it could have an impact on what direction may follow fastparquet, if future developments are intended. I have unfortunately no time to contribute so far. But I do appreciate that fastparquet is open to contributions.
For next evolutions/contributions possibly related/interacting with arrow, maybe there are some "general direction" to be identified? For instance:

  • 1/ fastparquet could be favored over arrow because it is "more" open and implements some specificities not in PyArrow? But it could rely on arrow as well as pandas does? (or on arrow through pandas)
  • 2/ or fastparquet could keep its specificity to do without arrow, but then if pandas is using it even more, should fastparquet try to favor full numpy use without pandas, and then offer numpy to pandas conversion only when pandas is installed?

No definitive answer is really requested, it may be thought later on when the question arises on a real use case. But I am thinking it may help to identify it.

I could read this article about arrow in pandas, and it seems like arrow may "naturally" impose itself gradually. So possibly option 1 makes more sense.

from fastparquet.

martindurant avatar martindurant commented on August 14, 2024

PyArrow will become a required dependency with pandas 3.0 to accommodate this change.

This could well mean the deprecation of this whole project, since there will be no one that doesn't have arrow, and we can't offer enough benefits. We'll see if there is any demand.

implements some specificities not in PyArrow

Note really, not any more. We can make metadata easier to manipulate, as you have done.

should fastparquet try to favor full numpy use without pandas

This is an interesting idea. The core.py functions actually do support a dict of numpy arrays, and variable bytes/UTF8 support is coming to numpy https://numpy.org/neps/nep-0055-string_dtype.html ). I think it would be a decent amount of effort to have writer and ParquetFile work like that, and of course existing users expecting pandas would be disappointed.

arrow may "naturally" impose itself gradually

We appear to be nearing the final stages of that.

from fastparquet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.