Describe the issue : Saving a DataFrame<

Or maybe they always set the timezone to UTC <p dir="au

MultiIndex roundtrip loses timezone about fastparquet HOT 4 CLOSED

2-5 commented on August 14, 2024

MultiIndex roundtrip loses timezone

from fastparquet.

Comments (4)

martindurant commented on August 14, 2024

Roundtrip is successful for a simple index or for a column? That at least gives you a workaround if so.

from fastparquet.

2-5 commented on August 14, 2024

Yes, there are workarounds.

I saw that there were a lot of issues reported in the past regarding MultiIndex. It seems there is a fundamental problem with it. Looking inside the .parquet file metadata I see this:

MultiIndex as in example

{
  "columns": [
    {
      "field_name": "time",
      "metadata": { "num_categories": 2, "ordered": false },
      "name": "time",
      "numpy_type": "int8",
      "pandas_type": "categorical"
    },
    {
      "field_name": "seq",
      "metadata": { "num_categories": 2, "ordered": false },
      "name": "seq",
      "numpy_type": "int8",
      "pandas_type": "categorical"
    }
  ],
  "index_columns": ["time", "seq"]
}

Simple index

{
  "columns": [
    {
      "field_name": "time",
      "metadata": { "timezone": "UTC" },
      "name": "time",
      "numpy_type": "datetime64[ns, UTC]",
      "pandas_type": "datetimetz"
    }
  ],
  "index_columns": ["time"]
}

I guess the fact that the "time" column is a DatetimeIndex is stored somewhere else, because according to the metadata is just an int8

from fastparquet.

2-5 commented on August 14, 2024

Interestingly, pyarrow restores the timezone when importing the MultiIndex exported by fastparquet. Or maybe they always set the timezone to UTC:

pd.read_parquet("data.parquet", engine="pyarrow")
pq.read_table("data.parquet").to_pandas()

from fastparquet.

martindurant commented on August 14, 2024

Or maybe they always set the timezone to UTC

The parquet standard changed it's mind on this one, so it's possible. Parquet only supports UTC or no timezone (~local).

I guess the fact that the "time" column is a DatetimeIndex is stored somewhere else, because according to the metadata is just an int8

Right, this is the actual type that pandas uses. Multi-indexes are implemented almost exactly like categoricals.

In [102]: print(pf.schema)
- schema:
| - val: INT64, OPTIONAL
| - time: INT64, TIMESTAMP[NANOS], OPTIONAL
  - seq: INT64, OPTIONAL

you see that the final dtype is here. But you are quite right, the timezone should appear in the pandas metadata section, I'm just not sure where!

If you write with arrow, you get

{'index_columns': ['time', 'seq'],
 'column_indexes': [{'name': None,
   'field_name': None,
   'pandas_type': 'unicode',
   'numpy_type': 'object',
   'metadata': {'encoding': 'UTF-8'}}],
 'columns': [{'name': 'val',
   'field_name': 'val',
   'pandas_type': 'int64',
   'numpy_type': 'int64',
   'metadata': None},
  {'name': 'time',
   'field_name': 'time',
   'pandas_type': 'datetimetz',
   'numpy_type': 'datetime64[ns]',
   'metadata': {'timezone': 'UTC'}},
  {'name': 'seq',
   'field_name': 'seq',
   'pandas_type': 'int64',
   'numpy_type': 'int64',
   'metadata': None}],

but keeping hold of the category labels and int dtype is actually very useful (the fastparquet version loads faster than arrow's because of this)

from fastparquet.

MultiIndex roundtrip loses timezone about fastparquet HOT 4 CLOSED

Comments (4)

MultiIndex as in example

Simple index

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent