Comments (7)
You can add those files - the data is all fake but the structure and columns match some real sources we get from some ancient reporting tools.
from fastexcel.
There are 3 issues:
- 1 issue with milliseconds that has been fixed in #159
- 1 issue with dtype guessing tracked by #160
- 1 feature request: to be able to enforce dtypes tracked by #173
from fastexcel.
Looks like all of the issues I ran into have been addressed in 0.10.
The duration, timestamp, and date columns are parsed correctly (had some issues with openpyxl with the duration columns, and issues with the blank rows with fastexcel). The employee ID column also retains the IDs which look like integers instead of replacing them with nulls.
Very cool, thanks for the work on this.
from fastexcel.
Added a file which represents a related issue. The columns representing hours, minutes, seconds spent in a state are returning as nulls for some reason.
excel_reader = fastexcel.read_excel(local_path)
pl_df = excel_reader.load_sheet(idx_or_name=0, header_row=9).to_polars()
If I read the same file with pandas the values are returned (pd_df = pd.read_excel(local_path, engine="openpyxl", skiprows=10)
And one more example:
test_excel_engine.xlsx
Another issue is that mixed columns which have values that look like numbers are being replaced with nulls.
You can see anyone that has an ID with text values has been replaced by nulls and the column is interpreted as a float
from fastexcel.
Thank you for the clean bug report, and especially for the files, they make debugging much easier 🙏
I do indeed reproduce the three issues you're describing. It seems to me that they a re caused by two distinct issues:
- For durations, we're we using calamine's
as_time
API to convert durations to milliseconds, as theas_duration
API was not available yet. However, this does not work anymore, asas_time
will returnNone
forDuration
values. The fix is very easy, and will be out soon - Type coercion is currently done on the first cell of the column only. It should be done on the entire column. The fix for that will be longer, as we need to figure out a way to do it without impacting performance too much.
I'll create two child issues for this.
There should be a way to override type inference
Could you please create a separate issue for that ? It definitely seems like a reasonable feature to ask for, but it will need some discussion so we can agree on an API.
@ldacey Would it be OK with you if I added your files as test cases to the repo ? Or should I modify the data in them ?
from fastexcel.
Hi - 0.9.0 works great for most of the files I needed to process but there is something odd going on if there are blank rows.
The first sheet works fine, but the second sheet fails (that reflects how actual data comes in from the source system):
File line 83, in
pl_df = excel_reader.load_sheet(idx_or_name=1, header_row=9).to_polars()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File line 64, in to_polars
df = pl.from_arrow(data=self.to_arrow())
^^^^^^^^^^^^^^^
File line 47, in to_arrow
return self._sheet.to_arrow()
^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Could not create RecordBatch from sheet Sheet2Caused by:
0: Could not build schema for sheet Sheet2
1: could not figure out column type for following type combination: {Timestamp(Millisecond, None), Utf8}
I could not figure out the precise cause. There is one row per employee ID per date even if they did not log into the system but fastexcel does not seem to like it if the first rows are null for the value columns.
Same issue if I try using to_pandas:
File line 55, in to_pandas
return self.to_arrow().to_pandas()
^^^^^^^^^^^^^^^
File line 47, in to_arrow
return self._sheet.to_arrow()
from fastexcel.
Glad to know it works for you!
from fastexcel.
Related Issues (20)
- Add `typing-extensions` to dependencies HOT 2
- Option to retain special symbols/ errors in excel as a string HOT 1
- usecols - don't require full range
- header_row does not work as expected if there are blank rows HOT 6
- Xlsx Chart Sheet Crash HOT 4
- schema_sample_rows=0 results in a table filled with null values HOT 6
- feat: support callable in use_columns parameter
- Add an option to disable automatic type coercion
- When coercing columns to strings, boolean cells turn into null HOT 5
- Coercing dates to strings adds time of the form '00:00:00' HOT 2
- Add a note to the `skip_rows` and `header_row` parameter description that rows before header_row are automatically skipped HOT 1
- Let `dtypes` take single value that applies to all columns AND/OR cast DateTime to String instead of raising UnsupportedColumnTypeCombinationError HOT 2
- Make new `dtype_coercion`='stringify' option
- use_columns is ignored when sheets are loaded eagerly
- Implement Arrow PyCapsule Interface & make pyarrow optional dependency HOT 1
- Unable to import fastexcel - symbol not found in flat namespace '_PyPyBaseObject_Type' HOT 9
- msys2 install failed: ERROR: Could not find a version that satisfies the requirement fastexcel (from versions: none) HOT 3
- Strange failure to determine `string` dtype column in ODS sheet/file HOT 2
- Silent deletion of data should be construed as a bug, not the default behavior
- Sheet `to_arrow()` produces `null not null` columns HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fastexcel.