Comments (4)
Roundtrip is successful for a simple index or for a column? That at least gives you a workaround if so.
from fastparquet.
Yes, there are workarounds.
I saw that there were a lot of issues reported in the past regarding MultiIndex
. It seems there is a fundamental problem with it. Looking inside the .parquet
file metadata I see this:
MultiIndex as in example
{
"columns": [
{
"field_name": "time",
"metadata": { "num_categories": 2, "ordered": false },
"name": "time",
"numpy_type": "int8",
"pandas_type": "categorical"
},
{
"field_name": "seq",
"metadata": { "num_categories": 2, "ordered": false },
"name": "seq",
"numpy_type": "int8",
"pandas_type": "categorical"
}
],
"index_columns": ["time", "seq"]
}
Simple index
{
"columns": [
{
"field_name": "time",
"metadata": { "timezone": "UTC" },
"name": "time",
"numpy_type": "datetime64[ns, UTC]",
"pandas_type": "datetimetz"
}
],
"index_columns": ["time"]
}
I guess the fact that the "time"
column is a DatetimeIndex
is stored somewhere else, because according to the metadata is just an int8
from fastparquet.
Interestingly, pyarrow
restores the timezone when importing the MultiIndex
exported by fastparquet
. Or maybe they always set the timezone to UTC:
pd.read_parquet("data.parquet", engine="pyarrow")
pq.read_table("data.parquet").to_pandas()
from fastparquet.
Or maybe they always set the timezone to UTC
The parquet standard changed it's mind on this one, so it's possible. Parquet only supports UTC or no timezone (~local).
I guess the fact that the "time" column is a DatetimeIndex is stored somewhere else, because according to the metadata is just an int8
Right, this is the actual type that pandas uses. Multi-indexes are implemented almost exactly like categoricals.
In [102]: print(pf.schema)
- schema:
| - val: INT64, OPTIONAL
| - time: INT64, TIMESTAMP[NANOS], OPTIONAL
- seq: INT64, OPTIONAL
you see that the final dtype is here. But you are quite right, the timezone should appear in the pandas metadata section, I'm just not sure where!
If you write with arrow, you get
{'index_columns': ['time', 'seq'],
'column_indexes': [{'name': None,
'field_name': None,
'pandas_type': 'unicode',
'numpy_type': 'object',
'metadata': {'encoding': 'UTF-8'}}],
'columns': [{'name': 'val',
'field_name': 'val',
'pandas_type': 'int64',
'numpy_type': 'int64',
'metadata': None},
{'name': 'time',
'field_name': 'time',
'pandas_type': 'datetimetz',
'numpy_type': 'datetime64[ns]',
'metadata': {'timezone': 'UTC'}},
{'name': 'seq',
'field_name': 'seq',
'pandas_type': 'int64',
'numpy_type': 'int64',
'metadata': None}],
but keeping hold of the category labels and int dtype is actually very useful (the fastparquet version loads faster than arrow's because of this)
from fastparquet.
Related Issues (20)
- Schema evolution
- `test_delta_from_def_2` fails on aarch64, armv7 and ppc64le HOT 6
- OverflowError: value too large to convert to int - fastparquet.cencoding.write_thrift HOT 8
- Parquet files can't exceed 2.14GB? Write throws overflow errors when filesize in bytes exceeds int32 limit... HOT 4
- Footer metadata thrift num_rows not being properly updated if append is used to construct parquet file HOT 2
- update_file_custom_metadata(..., is_metadata_file=False): UnboundLocalError: local variable 'is_metadata' referenced before assignment HOT 2
- Can't set `Categorical._codes` in `pandas=2.0` HOT 6
- KeyError: 'dir0' in partitioned dataset HOT 3
- Fails to rountrip non-`ns` `datetime64` with `pandas` 2.0 HOT 3
- OverflowError with a 3GB, 11M-line JSONL file HOT 6
- Speed up Parquet Writing? HOT 7
- fastparquet 2023.1.0 may fail to dump dataframes with nested objects HOT 2
- Reading a Parquet file produced by pyarrow results to corrupted data read HOT 5
- BUG single list of filters does not appear to AND properly HOT 7
- [Question] Support for `arrow` data types when reading data with `fastparquet` HOT 5
- Memory spike from converting Pandas StringDtype to Numpy unicode array. HOT 3
- Allow dict loading for single row-group
- sdist for 2023.4.0 is missing on PyPI HOT 2
- Can't read parquet files created by pyspark using ParquetFile.iter_row_groups() HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fastparquet.