toucantoco / fastexcel Goto Github PK

View Code? Open in Web Editor NEW

73.0 73.0 4.0 46.83 MB

A Python wrapper around calamine

Home Page: http://fastexcel.toucantoco.dev/

License: MIT License

Python 48.99% Rust 50.00% Makefile 0.90% Shell 0.11%

arrow pandas polars python rust

fastexcel's People

Contributors

Stargazers

Watchers

Forkers

simrit1 bmaggard alexander-beedie jlondonobo

fastexcel's Issues

Non regression test

Would be great to have some kind of non regression performance test like pydantic-core for example

Create conda package for install with `conda`/`mamba`

I was hoping to benchmark fastexcel vs xlsx2csv to determine if I'd benefit from the performance in the context of polars.

I haven't found any discussion mentioning conda-forge, but they have a guide for contributing packages.

Would this be something you could add to your CI?

Edit:
Since polars is similar in the python/rust overlap, the polars-feedstock might be a helpful reference.

Add reproducible benchmarks to the README

Add benchmarks to the README (speed & memory), with scripts allowing tor reproduce them. Add multiple scnearios. Some ideas could be:

Single sheet, not chunked
Single sheet, chunked
Excel to parquet, [not] chunked
Excel to polars (once available)
...

add `dtype`-like param to enforce a dtype for a given column

see dtype param in https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

Refactoring the arrow array's creations

fastexcel/src/types/excelsheet.rs

Lines 48 to 80 in dfe46bd

 fn create_boolean_array(data: &Range<CalDataType>, col: usize, height: usize) -> Arc<dyn Array> { 

 Arc::new(BooleanArray::from_iter((1..height).map(|row| { 

 data.get((row, col)).and_then(|cell| cell.get_bool()) 

 }))) 

 } 

 fn create_int_array(data: &Range<CalDataType>, col: usize, height: usize) -> Arc<dyn Array> { 

 Arc::new(Int64Array::from_iter( 

 (1..height).map(|row| data.get((row, col)).and_then(|cell| cell.get_int())), 

 )) 

 } 

 fn create_float_array(data: &Range<CalDataType>, col: usize, height: usize) -> Arc<dyn Array> { 

 Arc::new(Float64Array::from_iter((1..height).map(|row| { 

 data.get((row, col)).and_then(|cell| cell.get_float()) 

 }))) 

 } 

 fn create_string_array(data: &Range<CalDataType>, col: usize, height: usize) -> Arc<dyn Array> { 

 Arc::new(StringArray::from_iter((1..height).map(|row| { 

 data.get((row, col)).and_then(|cell| cell.get_string()) 

 }))) 

 } 

 fn create_date_array(data: &Range<CalDataType>, col: usize, height: usize) -> Arc<dyn Array> { 

 Arc::new(TimestampMillisecondArray::from_iter((1..height).map( 

 |row| { 

 data.get((row, col)) 

 .and_then(|cell| cell.as_datetime()) 

 .map(|dt| dt.timestamp_millis()) 

 }, 

 ))) 

 }

Maybe we could change this :

Maybe add a trait
Maybe a closure
Maybe something else...

Add tests that read sample excel files

Make `pandas` an extra and provide functions returning `RecordBatch` rather than `DataFrames`

We want this lib to load excel fast, but we don't want to be tightly coupled with pandas

use only cargo version and export it in version

Currently we have two versions: one in cargo.toml and one in pyproject.toml
We can do like in pydantic-core and use only cargo.toml one as source of truth and export it in a __version__ variable

Implement proper python exceptions for our different error cases

For now, we only have RuntimeError, which is clearly sub-optimal

Documentation versioning

Currently the docs on https://fastexcel.toucantoco.dev are built from the main branch. It would be nice to have versioned docs à la RTD so users can see the docs for their specific version

Fastexcel fails to load an excel file with polars

dfs = pl.read_excel('input.xlsx', engine='calamine', sheet_id=0)
for key in dfs.keys():
    print(dfs[key].head())

Error:

  Traceback (most recent call last):
  File "/Users/Desktop/workspace/pocs/demo.py", line 27, in <module>
    dfs = pl.read_excel('compass_input.xlsx', engine='calamine', sheet_id=0)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/Desktop/workspace/pocs/.venv/lib/python3.11/site-packages/polars/utils/deprecation.py", line 136, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/Desktop/workspace/pocs/.venv/lib/python3.11/site-packages/polars/utils/deprecation.py", line 136, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/Desktop/workspace/pocs/.venv/lib/python3.11/site-packages/polars/io/spreadsheet/functions.py", line 259, in read_excel
    return _read_spreadsheet(
           ^^^^^^^^^^^^^^^^^^
  File "/Users/Desktop/workspace/pocs/.venv/lib/python3.11/site-packages/polars/io/spreadsheet/functions.py", line 487, in _read_spreadsheet
    parsed_sheets = {
                    ^
  File "/Users/Desktop/workspace/pocs/.venv/lib/python3.11/site-packages/polars/io/spreadsheet/functions.py", line 488, in <dictcomp>
    name: reader_fn(
          ^^^^^^^^^^
  File "/Users/Desktop/workspace/pocs/.venv/lib/python3.11/site-packages/polars/io/spreadsheet/functions.py", line 834, in _read_spreadsheet_calamine
    df = ws.to_polars()
         ^^^^^^^^^^^^^^
  File "/Users/Desktop/workspace/pocs/.venv/lib/python3.11/site-packages/fastexcel/__init__.py", line 64, in to_polars
    df = pl.from_arrow(data=self.to_arrow())
                            ^^^^^^^^^^^^^^^
  File "/Users/Desktop/workspace/pocs/.venv/lib/python3.11/site-packages/fastexcel/__init__.py", line 47, in to_arrow
    return self._sheet.to_arrow()
           ^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Could not create RecordBatch from sheet Hybrid_Productivity

Caused by:
    0: Could not build schema for sheet Hybrid_Productivity
    1: Error in calamine cell: NA

Installed versions:

--------Version info---------
Polars:               0.20.7
Index type:           UInt32
Platform:             macOS-12.7.3-arm64-arm-64bit
Python:               3.11.2 (v3.11.2:878ead1ac1, Feb  7 2023, 10:02:41) [Clang 13.0.0 (clang-1300.0.29.30)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.2
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.0
pyarrow:              15.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
None

Process finished with exit code 0

feat: provide a `to_python` method

Add a parameter to to_pandas and to_polars methods to allow falling back on python object creation in case to_arrow fails

Be able to read also xls files

We can probably just switch to open_workbook_auto

Make iterators over Sheets yield `Sheet` objects rather than `RecordBatch`

This would allow to provide extra metadata, such as the sheet name, dimensions etc

Accept `bytes` rather than a file path as input format

There are contexts where this would be nice to have, such as when fetching a file over the network. It would also allow a tighter integration with polars: pola-rs/polars#14000 (comment)

Code license

Are you going to specify a license for this code?
Since it uses apache arrow, maybe you could use the apache license?
I'd like to use it at work, but we can't use code without an open source license.

add makefile or readme instructions or both to setup dev env

we need to add at least

pip install -r test-requirements.txt
pre-commit install
maturin develop

Be able to select a subset of columns

We should be able to read an excel dataframe that starts at column B for example
see usecols of https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

ARM Mac fails to install package

Colleague of mine tried to build my Dockerfile which installed fastexcel but it failed to find a package. The only difference in our setups is that he is using Mac with ARM architecture.

It seems that there is no wheel for that in the build process (I guess that MacOS images are x86 arch).

I think what could help is to have sdist package so in absence of wheel it can be downloaded and built locally. https://github.com/PyO3/maturin#python-packaging-basics
At leas I think this is what some other projects do: https://github.com/pydantic/pydantic-core/blob/3f7df809123b643f84b9b5e39c008d9300692462/.github/workflows/ci.yml#L350-L363

parameter for dtype override option AND/OR better inference

Here's a file that when I open it with:

filename='2017-2018-annual-auction-round-2-results.xls'
sheetname='Annual 2017-18 Results RD 2'
df = fastexcel.read_excel(filename).load_sheet_by_name(sheetname).to_polars()

then the Source PNODEID column comes through as float64 with a bunch of nulls. I'd prefer it to be an int but it seems it should at least return a String rather than Float with nulls.

My workaround is to use calamine-python like this, not sure if there's a better way

cal=CalamineWorkbook.from_path(filename)
df2=pl.from_records(cal.get_sheet_by_name(sheetname).to_python(), orient='row')
df2.columns=df2.rows()[0]
df2.with_columns(pl.col('Source PNODEID').cast(pl.Int64)) 
## I have to cast all the non-string columns but this works, no value raises

To sum up:

The type inference has an issue here.
There should be a way to override type inference

how can we use pip install this

[docs] Provide a section about type inference

It'd be great to have a section explaining a bit how type inference works in the docs. Specifically:

How dtypes are inferred
Which type combinations get coerced to what
How specifying dtypes explicitly influences that

Add support for `Windows`

@lukapeschke: Seems fastexcel is currently only available for Linux/MacOS.
Would be great to get a Windows build out too 👍

Allow to read sheets without headers and to optionally specify columns

hey guys, any chance of publishing wheels for linux/arm64? the lack of an sdist for this means we're kinda stuck without it.
Is this on the roadmap anyway or would you appreciate a PR adding it to your release action?

Thanks for maintaining this!

Some duration columns contain only null values after load

For durations, we're we using calamine's as_time API to convert durations to milliseconds, as the as_duration API was not available yet. However, this does not work anymore, as as_time will return None for Duration values.

Originally posted by @lukapeschke in #158 (comment)

Improve schema inference - wrongly assigns null dtype and misses reading valid data at end of large file

Team,

Trying to read large xlsx file ( around 500k rows with 8 columns ), was using Pandas with calamine engine before, all good and got to know about polars with fastexcel [calamine], tried to adopt it and found this bug.

Observation:

During reading above large xlsx file with 1st column having valid value only at 143,407th row on overall 500k records xlsx file, I observe two issues:

the 1st column is inferred with dtype of "null" as in below snap:

the 1st column is left empty in the csv file, even with valid values present from 143,407th row onwards in xlsx file. Validated this when i tested data by writing the read excel dataframe into csv.

Code details:

Polars code

import polars as pl df = pl.read_excel(source = <<path of test.xlsx (attached)>>, sheet_id=1,engine="calamine", raise_if_empty = False print(df.sample(10))

Pandas code

import pandas as pd df = pd.read_excel(path, sheet_name=0, engine="calamine", dtype=str,na_filter=False)

Step to reproduce the issue:

Have the attached test.xlsx file with 1st column all empty/null and run above polars code. It prints 1st column inferred data type as null.

Please refer to pandas code above, that works for me without any data loss.

Technical Feature requests to resolve the issue:

1, ExcelReader option to read xlsx fields "as-is" / as Strings [without schema inference]
2, ExcelReader option to control Schema Inference row limit [ Hope it will help other use cases with less records, but not in my use case ]

System/lib details:
Windows OS , Python 3.11.5, polars==0.20.16, fastexcel==0.9.1,pandas==2.2.1,python-calamine==0.2.0

test.xlsx

When using fastexcel to read Excel files, empty rows are automatically ignored by default

When using fastexcel to read Excel files, empty rows are automatically ignored by default, which can create significant issues. It would be beneficial to provide an optional parameter to adjust this behavior.

Add skip_rows and n_rows

The idea of skip_rows and n_rows is to be able to have only a part of the data
skip_rows after the header
n_rows is the number of rows to retrieve

dtype resolution is wrong for multi-dtype columns

Type coercion is currently done on the first cell of the column only. It should be done on the entire column. This needs to be addressed without impacting performance too much.

Originally posted by @lukapeschke in #158 (comment)

"use_columns" keyword argument not recognized by `load_sheet`.

According to https://fastexcel.toucantoco.dev/fastexcel.html#ExcelReader.load_sheet, there is a "use_columns" keyword
argument, but in fact the keyword is not recognized.

>>> import fastexcel
>>> te = fastexcel.read_excel("1.xlsx")
>>> te
ExcelReader<1.xlsx>
>>> te.load_sheet(0, use_columns="A:E")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ExcelReader.load_sheet() got an unexpected keyword argument 'use_columns'
>>>

Columns containing missing values will render entire column as `NaN`

(
    fastexcel.read_excel(file)
    .load_sheet(idx_or_name=0, header_row=4)
    .to_pandas()
    .iloc[:, 1:]
)

When reading an Excel file, if a column has missing values, it will render the entire column as NaN (even if it is not missing).

Upon running .info()

(
    fastexcel.read_excel(file)
    .load_sheet(idx_or_name=0, header_row=4)
    .to_pandas()
    .iloc[:, 1:]
    .info()
)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9438 entries, 0 to 9437
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Date              9435 non-null   object 
 1   Transaction Type  9435 non-null   object 
 2   Num               9435 non-null   float64
 3   Name              9435 non-null   object 
 4   Memo/Description  0 non-null      object 
 5   Due Date          9435 non-null   object 
 6   Amount            9435 non-null   float64
 7   Open Balance      9435 non-null   float64
 8   Jurisdiction      0 non-null      object 
 9   Project Manager   0 non-null      object 
 10  A/R Paid          9435 non-null   object 
 11  Project Address   0 non-null      object 
 12  Permit No         0 non-null      object 
 13  Sales Rep         0 non-null      object 
dtypes: float64(3), object(11)
memory usage: 1.0+ MB

[CI] Fix docs jobs

The docs job fails on the "commit" step when no changes happened to the docs. Make the script a bit smarter and exit if there's nothing more to be committed

ExcelReader parses "NULL" as string values, instead of empty/null values

Hello, I'm following up on pola-rs/polars#14495 as you requested. After a bit more digging, I figured out where part of the issue is.

Attached are three short example files, containing:

TEST_FASTEXCEL_WITH_NULLS.xlsx
- 6 columns with 10 records each and a header, each column formatted as the data type is supposed to represent (integer, date, timestamp, float), but including "NULL" as values, representing empty values.
TEST_FASTEXCEL_WITHOUT_NULLS.xlsx
- Same as above, but removing "NULL" values and leaving those cells empty instead
TEST_FASTEXCEL_MIXED_DATA.xlsx
- Same as the prior one, but introducing two additional columns with mixed data types (integer plus string, date plus string, no "NULL" string values this time)

As it stands, Fastexcel (via polars) is not able to infer mixed types in the first and third examples/files, but can and does load the second file whilst inferring the data types correctly. This indicates that cells containing "NULL" are read as strings, instead of empty values.

I hope this helps. The string fallback conversion would be good, but given that "NULL" values are commonplace, especially in the context of massive CSV created from SQL dumps, I think addressing this first would fix a lot of loading issues.

Thank you for working on this wrapper!

Cannot handle special symbols

Hi team,
I am facing one issue on reading excel sheet through Polars. It says calamine cell error: #VALUE!
The sheet in interest does not get read through standard api.

How to handle such special symbols through fastexcel?

Thank you.

I am using fastexcel==0.10.2.

Empty sheet crash

When we load a empty sheet fastexcel is crashing. We may have a more graceful approche.
We could return a empty recordbatch ?

When the bug is fixed you may also update the benchmark test file.

`column_names` are taken partially when `use_columns` does not include all columns

Given this worksheet data (without any empty area):

A	B	C
21	22	23
31	32	33
41	42	43

The following code:

import fastexcel as fe


file = r'<path to file>'
params = {'idx_or_name': 0, 'header_row': None, 'skip_rows': 1, 'use_columns': [1, 2], 'column_names': ['Col B', 'Col C']}

print(fe.read_excel(file).load_sheet(**params).to_polars())

Outputs:

shape: (3, 2)
┌───────┬──────────────┐
│ Col C ┆ __UNNAMED__2 │
│ ---   ┆ ---          │
│ f64   ┆ f64          │
╞═══════╪══════════════╡
│ 22.0  ┆ 23.0         │
│ 32.0  ┆ 33.0         │
│ 42.0  ┆ 43.0         │
└───────┴──────────────┘

Expected output:

shape: (3, 2)
┌───────┬───────┐
│ Col B ┆ Col C │
│ ---   ┆ ---   │
│ f64   ┆ f64   │
╞═══════╪═══════╡
│ 22.0  ┆ 23.0  │
│ 32.0  ┆ 33.0  │
│ 42.0  ┆ 43.0  │
└───────┴───────┘

improve README + doc

add getting started in README and doc
add badges
add CONTRIBUTING
add PULL REQUEST template

Straightforward but important for repo quality

Be able to read a specific sheet (add kwarg)

Implement multiple header_rows

I find that people will often have Excel sheets where they use multiple header rows to make up their column names.

Here's a snippet of how I deal with that coming from a df generated with python_calamine

header_merge_char="_" # for example
header_rows = 3 # again, for example
df.columns=[
    header_merge_char.join([y for y in x if y != ""])
    for x in zip(*[df.rows()[x] for x in range(header_rows)])
]

this is just a snippet and doesn't handle duplicate column names but that's a separate issue.

A more advanced version of this might infer the header_rows by skipping down (let's say) 10 rows and look for types starting there. Then choose a column which isn't a string and go back to the true row=0 and see how many rows down it needs to go before it no longer sees strings. Then that's the inferred header_rows

Add test fixtures for 'xlsm', 'xlsb', 'xla', 'xlam' extensions

refactor: support mixed indices and column names for dtypes and column selection

Also add some information about selected columns.

name: str
index: int
dtype: DType
name_from: "provided" | "looked_up" | "generated"
dtype_from: "provided_by_index" | "provided_by_name" | "guessed"

Support mix of date/timestamp and strings

See #181 (comment)
and #158 (comment)

TEST_FASTEXCEL_MIXED_DATA.xlsx
- Same as the prior one, but introducing two additional columns with mixed data types (integer plus string, date plus string, no "NULL" string values this time)
example-skip-rows.xlsx

0: Could not build schema for sheet Sheet2
1: could not figure out column type for following type combination: {Timestamp(Millisecond, None), Utf8}

Provide functions returning `polars.DataFrame`

These should be implemented in rust directly. It'd allow us to prevent the arrow -> PyBytes -> arrow -> pandas memory overhead

Move Arrow Schema out of ExcelSheet struct

The schema should not be part of the excelsheet struct.
Currently it is needed only in the TryFrom<&ExcelSheet> for RecordBatch implementation. So we should be able to create it when needed.

Allow to disable skipping empty rows/columns at the beginning of the worksheet

There is a similar issue, which is closed without, as it appears, any fix (with v0.9.1 being actual).

Expected behaviour is outlined in examples of python-calamine for an option skip_empty_area:

from python_calamine import CalamineWorkbook

workbook = CalamineWorkbook.from_path("file.xlsx").get_sheet_by_name("Sheet1").to_python(skip_empty_area=False)
# [
# [",  ",  ",  ",  ",  ",  "],
# ["1",  "2",  "3",  "4",  "5",  "6",  "7"],
# ["1",  "2",  "3",  "4",  "5",  "6",  "7"],
# ["1",  "2",  "3",  "4",  "5",  "6",  "7"],
# ]

This automatic behavior is kind of surprising when dealing with files with empty parts at the beginning, as the whole rows calculations get confusing. i.e. for header_row you have to count rows as they appear (i.e. including empty rows), but for skip_rows you have to count as if there are no empty rows, which is clearly not the most user-friendly approach.

Please consider adding an option/parameter to disable default behavor.
Thank you!

Support for Reading Excel Tables

From my experience it's usually much safer to load data from an excel table than from a sheet. would be nice if one could get the table names per sheet and get the table data as arrow/pandas like with the sheets

Provide `abi3` wheels

abi3 wheels would allow us to build for our lowest supported Python version only (currently 3.8), while still being compatible with higher versions. This would allow to produce a single artifact per target plaform/arch rather than a wheel per platform/arch/version: https://pyo3.rs/v0.14.5/building_and_distribution#py_limited_apiabi3

It is also what polars does currently

Pros:

Less artifacts to build
Faster CI (much smaller wheel build check matrix)

A potential downside could be a performance cost. To quote PyO3's docs:

The downside of this is that PyO3 can't use optimizations which rely on being compiled against a known exact Python version. It's up to you to decide whether this matters for your extension module.

We'd have to run a benchmark with both types of wheels in Python 3.12 to check the impact of this, but since our API is quite simple, we shouldn't see a difference

	fn create_boolean_array(data: &Range<CalDataType>, col: usize, height: usize) -> Arc<dyn Array> {
	Arc::new(BooleanArray::from_iter((1..height).map(\|row\| {
	data.get((row, col)).and_then(\|cell\| cell.get_bool())
	})))
	}

	fn create_int_array(data: &Range<CalDataType>, col: usize, height: usize) -> Arc<dyn Array> {
	Arc::new(Int64Array::from_iter(
	(1..height).map(\|row\| data.get((row, col)).and_then(\|cell\| cell.get_int())),
	))
	}

	fn create_float_array(data: &Range<CalDataType>, col: usize, height: usize) -> Arc<dyn Array> {
	Arc::new(Float64Array::from_iter((1..height).map(\|row\| {
	data.get((row, col)).and_then(\|cell\| cell.get_float())
	})))
	}

	fn create_string_array(data: &Range<CalDataType>, col: usize, height: usize) -> Arc<dyn Array> {
	Arc::new(StringArray::from_iter((1..height).map(\|row\| {
	data.get((row, col)).and_then(\|cell\| cell.get_string())
	})))
	}

	fn create_date_array(data: &Range<CalDataType>, col: usize, height: usize) -> Arc<dyn Array> {
	Arc::new(TimestampMillisecondArray::from_iter((1..height).map(
	\|row\| {
	data.get((row, col))
	.and_then(\|cell\| cell.as_datetime())
	.map(\|dt\| dt.timestamp_millis())
	},
	)))
	}