Git Product home page Git Product logo

Comments (7)

ldacey avatar ldacey commented on September 27, 2024 1

You can add those files - the data is all fake but the structure and columns match some real sources we get from some ancient reporting tools.

from fastexcel.

PrettyWood avatar PrettyWood commented on September 27, 2024 1

There are 3 issues:

  • 1 issue with milliseconds that has been fixed in #159
  • 1 issue with dtype guessing tracked by #160
  • 1 feature request: to be able to enforce dtypes tracked by #173

from fastexcel.

ldacey avatar ldacey commented on September 27, 2024 1

Looks like all of the issues I ran into have been addressed in 0.10.

The duration, timestamp, and date columns are parsed correctly (had some issues with openpyxl with the duration columns, and issues with the blank rows with fastexcel). The employee ID column also retains the IDs which look like integers instead of replacing them with nulls.

Very cool, thanks for the work on this.

from fastexcel.

ldacey avatar ldacey commented on September 27, 2024

example-skip-rows.xlsx

Added a file which represents a related issue. The columns representing hours, minutes, seconds spent in a state are returning as nulls for some reason.

excel_reader = fastexcel.read_excel(local_path)
pl_df = excel_reader.load_sheet(idx_or_name=0, header_row=9).to_polars()

If I read the same file with pandas the values are returned (pd_df = pd.read_excel(local_path, engine="openpyxl", skiprows=10)

image

And one more example:

test_excel_engine.xlsx
Another issue is that mixed columns which have values that look like numbers are being replaced with nulls.

image

You can see anyone that has an ID with text values has been replaced by nulls and the column is interpreted as a float

image

from fastexcel.

lukapeschke avatar lukapeschke commented on September 27, 2024

Thank you for the clean bug report, and especially for the files, they make debugging much easier 🙏

I do indeed reproduce the three issues you're describing. It seems to me that they a re caused by two distinct issues:

  • For durations, we're we using calamine's as_time API to convert durations to milliseconds, as the as_duration API was not available yet. However, this does not work anymore, as as_time will return None for Duration values. The fix is very easy, and will be out soon
  • Type coercion is currently done on the first cell of the column only. It should be done on the entire column. The fix for that will be longer, as we need to figure out a way to do it without impacting performance too much.

I'll create two child issues for this.

@deanm0000

There should be a way to override type inference

Could you please create a separate issue for that ? It definitely seems like a reasonable feature to ask for, but it will need some discussion so we can agree on an API.

@ldacey Would it be OK with you if I added your files as test cases to the repo ? Or should I modify the data in them ?

from fastexcel.

ldacey avatar ldacey commented on September 27, 2024

Hi - 0.9.0 works great for most of the files I needed to process but there is something odd going on if there are blank rows.

example-skip-rows.xlsx

The first sheet works fine, but the second sheet fails (that reflects how actual data comes in from the source system):

File line 83, in
pl_df = excel_reader.load_sheet(idx_or_name=1, header_row=9).to_polars()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File line 64, in to_polars
df = pl.from_arrow(data=self.to_arrow())
^^^^^^^^^^^^^^^
File line 47, in to_arrow
return self._sheet.to_arrow()
^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Could not create RecordBatch from sheet Sheet2

Caused by:
0: Could not build schema for sheet Sheet2
1: could not figure out column type for following type combination: {Timestamp(Millisecond, None), Utf8}

image

I could not figure out the precise cause. There is one row per employee ID per date even if they did not log into the system but fastexcel does not seem to like it if the first rows are null for the value columns.

Same issue if I try using to_pandas:

File line 55, in to_pandas
return self.to_arrow().to_pandas()
^^^^^^^^^^^^^^^
File line 47, in to_arrow
return self._sheet.to_arrow()

from fastexcel.

lukapeschke avatar lukapeschke commented on September 27, 2024

Glad to know it works for you!

from fastexcel.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.