Comments (8)
The argument row_group_offsets
gives you control over how big the row groups are. The default is geared towards a "tall and narrow" table layout of the sort parquet was designed for.
another non-python language which doesn't support that structure
I'm surprised if there are frameworks that wouldn't be able to read this output.
from fastparquet.
Correct, it is not possible to change the data types in the stored thrift object - at least not if you want the file to be readable by anything other than fastparquet. When using write()
, you can specify how many rows go into each row group, so I suggest you use a smaller value. It might be reasonable for fastparquet to be able to guess a good number for this.
from fastparquet.
@martindurant adjusting the row group size doesn't seem to make a difference; the number that ends up exceeding the int32 threshold seems linked to the total size of the parquet file, and not any of the individual write sizes.
from fastparquet.
The total footer size (not data size) is given by a 4byte value, so it's not that. I could check the various byte offsets, but I think they are all 64bit (or var-int). I suppose your guess at the page header is correct, in which case chopping the data into row-groups (each of which contains one page per row) should help. You could also try using "hive" style output, in which each row group becomes a separate file.
from fastparquet.
thanks so much for the help @martindurant ; I'm aware of 'hive' style output but trying to avoid using it to facilitate ease-of-read in another non-python language which doesn't support that structure. I was surprised by this part of your answer
in which case chopping the data into row-groups (each of which contains one page per row) should help
...as I thought that's what was already happening under the hood when I call fastparquet.write
. Am I wrong in thinking so? In general, the code I'm using to write to the file is as follows:
fastparquet.write(
filename=fname,
data=df,
append=(os.path.exists(fname))
)
Is there another API that I would have to use such that it will write row groups instead?
from fastparquet.
@martindurant I am relatively new to parquet and still getting the "lay of the land", so to speak. Trying to discern what's part of the format specification and which parts are implementation-specific... Anyway the other language is Go and the other library is parquet-go - afaik it doesn't seem to support any parquet file hierarchy structures such as hive or drill.
Again, thanks so much for the help; I've toyed around with row_group_offsets
but it seems to make little difference; perhaps I just haven't found the right value yet(?). I also thought about instantiating a ParquetFile
object and using the write_row_groups
instance method, but it seemed to me like that should be generally the same as what's internally going on calling fastparquet.write
.
from fastparquet.
"hive" style without partitioning amounts to splitting each row group into a file, but no encoding of information into the path structure. It is certainly worth a try, although I know nothing about parquet-go.
Yes, the writing method(s) on ParquetFile are conveniences for append and alter operations on existing datasets, which use the same code beneath.
from fastparquet.
@martindurant thanks for all the support and prompt communication; closing this issue as it is not an issue with the lib
from fastparquet.
Related Issues (20)
- to_pandas(): cramjam.DecompressionError: snappy: output buffer (size = 262144) is smaller than required (size = 1048576) HOT 1
- BUG: dataframe.empty with non-nano pd.DatetimeTZDtype HOT 2
- a python-3.12 windows wheel HOT 13
- Some `fastparquet`-related tests are failing on Python 3.10 HOT 10
- Regression due to `_from_sequence` HOT 1
- attrs persistance for Pandas HOT 1
- Nullable types for 1 row vs multiple rows HOT 3
- update_file_custom_metadata error when file has no properties.
- schema evolution when writing the row groups does not work HOT 4
- Bug loading parquet files with timezone information HOT 6
- When changing to a larger dtype, its size must be a advisor of the total size in bytes of the last axis of the array HOT 6
- PyArrow will become a required dependency with pandas 3.0
- Option to not close() after write() when writing to buffer HOT 3
- Support zoneinfo.ZoneInfo timezones
- Loading List of List of Strings leads to nans HOT 6
- Upcoming pandas (>2.2.0) raises "read-only" errors HOT 3
- Categorical dtype not preserved with fastparquet-write, pyarrow-read HOT 2
- Numpy 2: OverflowError with int96 HOT 4
- Fastparquet raises on import with numpy 2.0 rc HOT 5
- New release? HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fastparquet.