For Pandas Series using the StringDtype extension ty

Memory spike from converting Pandas StringDtype to Numpy unicode array. about fastparquet HOT 3 CLOSED

mshober commented on August 14, 2024

Memory spike from converting Pandas StringDtype to Numpy unicode array.

from fastparquet.

Comments (3)

martindurant commented on August 14, 2024 1

I think you are right, and your use case of a few very long strings hasn't typically come up. I'm not sure parquet is a particularly optimised storage format for that. In any case, it should not cause a copy and memory spike.
However, I don't think that .astype(str) should make a fixed-length array, or indeed any copy at all in normal circumstances. Is there something else special about your data? It may also depend on pandas version.

All that being said, I don't see any downside to using .str.len(), so I would welcome a PR with this change.

from fastparquet.

martindurant commented on August 14, 2024

Actually, I think I stand corrected, and pandas will tend to copy unless one were to explicitly opt against it, and the dtype matches exactly (and str is not object). Only this is fast:

x.astype(x.dtype, copy=False)

(as it doesn't really do anything) and everything else requires copies.

from fastparquet.

mshober commented on August 14, 2024

While the response of pd.Series(["value"], dtype="string").astype(str) is a Series of dtypeobject, under the hood this operation is creating a numpy unicode array at some point. Not sure why Pandas does that; ~~I may open up a ticket with Pandas too just to make sure that's not a bug~~. Looks like this issue does not occur for Pandas 2.0

from fastparquet.

Memory spike from converting Pandas StringDtype to Numpy unicode array. about fastparquet HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent