<a href="https://www.pjm.com/-/media/markets-ops/ftr/annual-ftr-auction/2017-2018/2017

There are 3 issues: 1 issue with milliseconds that has been fi

<a href="https://github.com/ToucanToco/fastexcel/files/14073055/example-skip-rows.xlsx

parameter for dtype override option AND/OR better inference about fastexcel HOT 7 CLOSED

toucantoco commented on September 27, 2024 1

parameter for dtype override option AND/OR better inference

from fastexcel.

Comments (7)

ldacey commented on September 27, 2024 1

You can add those files - the data is all fake but the structure and columns match some real sources we get from some ancient reporting tools.

from fastexcel.

PrettyWood commented on September 27, 2024 1

There are 3 issues:

1 issue with milliseconds that has been fixed in #159
1 issue with dtype guessing tracked by #160
1 feature request: to be able to enforce dtypes tracked by #173

from fastexcel.

ldacey commented on September 27, 2024 1

Looks like all of the issues I ran into have been addressed in 0.10.

The duration, timestamp, and date columns are parsed correctly (had some issues with openpyxl with the duration columns, and issues with the blank rows with fastexcel). The employee ID column also retains the IDs which look like integers instead of replacing them with nulls.

Very cool, thanks for the work on this.

from fastexcel.

ldacey commented on September 27, 2024

example-skip-rows.xlsx

Added a file which represents a related issue. The columns representing hours, minutes, seconds spent in a state are returning as nulls for some reason.

excel_reader = fastexcel.read_excel(local_path)
pl_df = excel_reader.load_sheet(idx_or_name=0, header_row=9).to_polars()

If I read the same file with pandas the values are returned (pd_df = pd.read_excel(local_path, engine="openpyxl", skiprows=10)

And one more example:

test_excel_engine.xlsx
Another issue is that mixed columns which have values that look like numbers are being replaced with nulls.

You can see anyone that has an ID with text values has been replaced by nulls and the column is interpreted as a float

from fastexcel.

lukapeschke commented on September 27, 2024

Thank you for the clean bug report, and especially for the files, they make debugging much easier 🙏

I do indeed reproduce the three issues you're describing. It seems to me that they a re caused by two distinct issues:

For durations, we're we using calamine's as_time API to convert durations to milliseconds, as the as_duration API was not available yet. However, this does not work anymore, as as_time will return None for Duration values. The fix is very easy, and will be out soon
Type coercion is currently done on the first cell of the column only. It should be done on the entire column. The fix for that will be longer, as we need to figure out a way to do it without impacting performance too much.

I'll create two child issues for this.

@deanm0000

There should be a way to override type inference

Could you please create a separate issue for that ? It definitely seems like a reasonable feature to ask for, but it will need some discussion so we can agree on an API.

@ldacey Would it be OK with you if I added your files as test cases to the repo ? Or should I modify the data in them ?

from fastexcel.

ldacey commented on September 27, 2024

Hi - 0.9.0 works great for most of the files I needed to process but there is something odd going on if there are blank rows.

example-skip-rows.xlsx

The first sheet works fine, but the second sheet fails (that reflects how actual data comes in from the source system):

File line 83, in
pl_df = excel_reader.load_sheet(idx_or_name=1, header_row=9).to_polars()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File line 64, in to_polars
df = pl.from_arrow(data=self.to_arrow())
^^^^^^^^^^^^^^^
File line 47, in to_arrow
return self._sheet.to_arrow()
^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Could not create RecordBatch from sheet Sheet2

Caused by:
0: Could not build schema for sheet Sheet2
1: could not figure out column type for following type combination: {Timestamp(Millisecond, None), Utf8}

I could not figure out the precise cause. There is one row per employee ID per date even if they did not log into the system but fastexcel does not seem to like it if the first rows are null for the value columns.

Same issue if I try using to_pandas:

File line 55, in to_pandas
return self.to_arrow().to_pandas()
^^^^^^^^^^^^^^^
File line 47, in to_arrow
return self._sheet.to_arrow()

from fastexcel.

lukapeschke commented on September 27, 2024

Glad to know it works for you!

from fastexcel.

parameter for dtype override option AND/OR better inference about fastexcel HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent