Comments (5)
- while you refer in a generic manner to Arrow data types, the PR you point to refers specifically to Arrow string, right?
Yes, strings being the most valuable use case, we wanted to make sure we can support them in dask
as soon as possible. But eventually, we'll want to support other arrow-backed data types.
- the struct Arrow string, known to be lighter than python string, is already spotted in issue arrow string type?? #640
Thank you @yohplala, I don't know how I missed that one - probably searched for pyarrow
vs arrow
.
from fastparquet.
Hi @j-bennet ,
@martindurant will obviously be of more help, but meanwhile, I understand that:
- while you refer in a generic manner to Arrow data types, the PR you point to refers specifically to Arrow string, right?
- the struct Arrow string, known to be lighter than python string, is already spotted in issue #640
Bests,
from fastparquet.
It is true that we don't know our userbase very well, but I think it is fair to say that most people use fastparquet over arrow because they don't want to install both. There remain a few features that we have and they don't, but not many.
As @yohplala points out, generic (nested, variable-length) arrow support is possible but complex and I don't think on anyone's map - especially since the main use case is via pandas, which has nothing useful it can do with those types. If we were to do this, it would be more likely for awkward, which has very similar layout in memory, but significantly better API to deal with. (awkward also has some more flexibility, allowing start/stop indexing into a buffer rather than just offsets, for instance)
As for strings, there is indeed a bigger possible improvement to be made, and pandas is already much better integrated with the arrow backend (for string->string and string->simple type operations). Nevertheless, I don't think we're going to work on this and a possible/optional arrow dependency without significant appetite from the community.
from fastparquet.
Hello @martindurant
I hope you are well. It is sometime I have not come here around.
You mentioned previously:
It is true that we don't know our userbase very well, but I think it is fair to say that most people use fastparquet over arrow because they don't want to install both. There remain a few features that we have and they don't, but not many.
Martin, you may have spotted in the last pandas 2.2 release notes this comment about PyArrow.
PyArrow will become a required dependency with pandas 3.0 to accommodate this change.
I am raising this as maybe it could have an impact on what direction may follow fastparquet, if future developments are intended. I have unfortunately no time to contribute so far. But I do appreciate that fastparquet is open to contributions.
For next evolutions/contributions possibly related/interacting with arrow, maybe there are some "general direction" to be identified? For instance:
- 1/ fastparquet could be favored over arrow because it is "more" open and implements some specificities not in PyArrow? But it could rely on arrow as well as pandas does? (or on arrow through pandas)
- 2/ or fastparquet could keep its specificity to do without arrow, but then if pandas is using it even more, should fastparquet try to favor full numpy use without pandas, and then offer numpy to pandas conversion only when pandas is installed?
No definitive answer is really requested, it may be thought later on when the question arises on a real use case. But I am thinking it may help to identify it.
I could read this article about arrow in pandas, and it seems like arrow may "naturally" impose itself gradually. So possibly option 1 makes more sense.
from fastparquet.
PyArrow will become a required dependency with pandas 3.0 to accommodate this change.
This could well mean the deprecation of this whole project, since there will be no one that doesn't have arrow, and we can't offer enough benefits. We'll see if there is any demand.
implements some specificities not in PyArrow
Note really, not any more. We can make metadata easier to manipulate, as you have done.
should fastparquet try to favor full numpy use without pandas
This is an interesting idea. The core.py functions actually do support a dict of numpy arrays, and variable bytes/UTF8 support is coming to numpy https://numpy.org/neps/nep-0055-string_dtype.html ). I think it would be a decent amount of effort to have writer and ParquetFile work like that, and of course existing users expecting pandas would be disappointed.
arrow may "naturally" impose itself gradually
We appear to be nearing the final stages of that.
from fastparquet.
Related Issues (20)
- to_pandas(): cramjam.DecompressionError: snappy: output buffer (size = 262144) is smaller than required (size = 1048576) HOT 1
- BUG: dataframe.empty with non-nano pd.DatetimeTZDtype HOT 2
- a python-3.12 windows wheel HOT 13
- Some `fastparquet`-related tests are failing on Python 3.10 HOT 10
- Regression due to `_from_sequence` HOT 1
- attrs persistance for Pandas HOT 1
- Nullable types for 1 row vs multiple rows HOT 3
- update_file_custom_metadata error when file has no properties.
- schema evolution when writing the row groups does not work HOT 4
- Bug loading parquet files with timezone information HOT 6
- When changing to a larger dtype, its size must be a advisor of the total size in bytes of the last axis of the array HOT 6
- PyArrow will become a required dependency with pandas 3.0
- Option to not close() after write() when writing to buffer HOT 3
- Support zoneinfo.ZoneInfo timezones
- Loading List of List of Strings leads to nans HOT 6
- Upcoming pandas (>2.2.0) raises "read-only" errors HOT 3
- Categorical dtype not preserved with fastparquet-write, pyarrow-read HOT 2
- Numpy 2: OverflowError with int96 HOT 4
- Fastparquet raises on import with numpy 2.0 rc HOT 5
- New release? HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fastparquet.