Git Product home page Git Product logo

Comments (4)

martindurant avatar martindurant commented on August 14, 2024

Roundtrip is successful for a simple index or for a column? That at least gives you a workaround if so.

from fastparquet.

2-5 avatar 2-5 commented on August 14, 2024

Yes, there are workarounds.

I saw that there were a lot of issues reported in the past regarding MultiIndex. It seems there is a fundamental problem with it. Looking inside the .parquet file metadata I see this:

MultiIndex as in example

{
  "columns": [
    {
      "field_name": "time",
      "metadata": { "num_categories": 2, "ordered": false },
      "name": "time",
      "numpy_type": "int8",
      "pandas_type": "categorical"
    },
    {
      "field_name": "seq",
      "metadata": { "num_categories": 2, "ordered": false },
      "name": "seq",
      "numpy_type": "int8",
      "pandas_type": "categorical"
    }
  ],
  "index_columns": ["time", "seq"]
}

Simple index

{
  "columns": [
    {
      "field_name": "time",
      "metadata": { "timezone": "UTC" },
      "name": "time",
      "numpy_type": "datetime64[ns, UTC]",
      "pandas_type": "datetimetz"
    }
  ],
  "index_columns": ["time"]
}

I guess the fact that the "time" column is a DatetimeIndex is stored somewhere else, because according to the metadata is just an int8

from fastparquet.

2-5 avatar 2-5 commented on August 14, 2024

Interestingly, pyarrow restores the timezone when importing the MultiIndex exported by fastparquet. Or maybe they always set the timezone to UTC:

pd.read_parquet("data.parquet", engine="pyarrow")
pq.read_table("data.parquet").to_pandas()

from fastparquet.

martindurant avatar martindurant commented on August 14, 2024

Or maybe they always set the timezone to UTC

The parquet standard changed it's mind on this one, so it's possible. Parquet only supports UTC or no timezone (~local).

I guess the fact that the "time" column is a DatetimeIndex is stored somewhere else, because according to the metadata is just an int8

Right, this is the actual type that pandas uses. Multi-indexes are implemented almost exactly like categoricals.

In [102]: print(pf.schema)
- schema:
| - val: INT64, OPTIONAL
| - time: INT64, TIMESTAMP[NANOS], OPTIONAL
  - seq: INT64, OPTIONAL

you see that the final dtype is here. But you are quite right, the timezone should appear in the pandas metadata section, I'm just not sure where!

If you write with arrow, you get

{'index_columns': ['time', 'seq'],
 'column_indexes': [{'name': None,
   'field_name': None,
   'pandas_type': 'unicode',
   'numpy_type': 'object',
   'metadata': {'encoding': 'UTF-8'}}],
 'columns': [{'name': 'val',
   'field_name': 'val',
   'pandas_type': 'int64',
   'numpy_type': 'int64',
   'metadata': None},
  {'name': 'time',
   'field_name': 'time',
   'pandas_type': 'datetimetz',
   'numpy_type': 'datetime64[ns]',
   'metadata': {'timezone': 'UTC'}},
  {'name': 'seq',
   'field_name': 'seq',
   'pandas_type': 'int64',
   'numpy_type': 'int64',
   'metadata': None}],

but keeping hold of the category labels and int dtype is actually very useful (the fastparquet version loads faster than arrow's because of this)

from fastparquet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.