Git Product home page Git Product logo

fastexcel's People

Contributors

alexander-beedie avatar dependabot[bot] avatar jgundermann avatar jlondonobo avatar lukapeschke avatar prettywood avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fastexcel's Issues

Add reproducible benchmarks to the README

Add benchmarks to the README (speed & memory), with scripts allowing tor reproduce them. Add multiple scnearios. Some ideas could be:

  • Single sheet, not chunked
  • Single sheet, chunked
  • Excel to parquet, [not] chunked
  • Excel to polars (once available)
  • ...

Refactoring the arrow array's creations

fn create_boolean_array(data: &Range<CalDataType>, col: usize, height: usize) -> Arc<dyn Array> {
Arc::new(BooleanArray::from_iter((1..height).map(|row| {
data.get((row, col)).and_then(|cell| cell.get_bool())
})))
}
fn create_int_array(data: &Range<CalDataType>, col: usize, height: usize) -> Arc<dyn Array> {
Arc::new(Int64Array::from_iter(
(1..height).map(|row| data.get((row, col)).and_then(|cell| cell.get_int())),
))
}
fn create_float_array(data: &Range<CalDataType>, col: usize, height: usize) -> Arc<dyn Array> {
Arc::new(Float64Array::from_iter((1..height).map(|row| {
data.get((row, col)).and_then(|cell| cell.get_float())
})))
}
fn create_string_array(data: &Range<CalDataType>, col: usize, height: usize) -> Arc<dyn Array> {
Arc::new(StringArray::from_iter((1..height).map(|row| {
data.get((row, col)).and_then(|cell| cell.get_string())
})))
}
fn create_date_array(data: &Range<CalDataType>, col: usize, height: usize) -> Arc<dyn Array> {
Arc::new(TimestampMillisecondArray::from_iter((1..height).map(
|row| {
data.get((row, col))
.and_then(|cell| cell.as_datetime())
.map(|dt| dt.timestamp_millis())
},
)))
}

Maybe we could change this :

  • Maybe add a trait
  • Maybe a closure
  • Maybe something else...

Fastexcel fails to load an excel file with polars

dfs = pl.read_excel('input.xlsx', engine='calamine', sheet_id=0)
for key in dfs.keys():
    print(dfs[key].head())

Error:

  Traceback (most recent call last):
  File "/Users/Desktop/workspace/pocs/demo.py", line 27, in <module>
    dfs = pl.read_excel('compass_input.xlsx', engine='calamine', sheet_id=0)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/Desktop/workspace/pocs/.venv/lib/python3.11/site-packages/polars/utils/deprecation.py", line 136, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/Desktop/workspace/pocs/.venv/lib/python3.11/site-packages/polars/utils/deprecation.py", line 136, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/Desktop/workspace/pocs/.venv/lib/python3.11/site-packages/polars/io/spreadsheet/functions.py", line 259, in read_excel
    return _read_spreadsheet(
           ^^^^^^^^^^^^^^^^^^
  File "/Users/Desktop/workspace/pocs/.venv/lib/python3.11/site-packages/polars/io/spreadsheet/functions.py", line 487, in _read_spreadsheet
    parsed_sheets = {
                    ^
  File "/Users/Desktop/workspace/pocs/.venv/lib/python3.11/site-packages/polars/io/spreadsheet/functions.py", line 488, in <dictcomp>
    name: reader_fn(
          ^^^^^^^^^^
  File "/Users/Desktop/workspace/pocs/.venv/lib/python3.11/site-packages/polars/io/spreadsheet/functions.py", line 834, in _read_spreadsheet_calamine
    df = ws.to_polars()
         ^^^^^^^^^^^^^^
  File "/Users/Desktop/workspace/pocs/.venv/lib/python3.11/site-packages/fastexcel/__init__.py", line 64, in to_polars
    df = pl.from_arrow(data=self.to_arrow())
                            ^^^^^^^^^^^^^^^
  File "/Users/Desktop/workspace/pocs/.venv/lib/python3.11/site-packages/fastexcel/__init__.py", line 47, in to_arrow
    return self._sheet.to_arrow()
           ^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Could not create RecordBatch from sheet Hybrid_Productivity

Caused by:
    0: Could not build schema for sheet Hybrid_Productivity
    1: Error in calamine cell: NA

Installed versions:

--------Version info---------
Polars:               0.20.7
Index type:           UInt32
Platform:             macOS-12.7.3-arm64-arm-64bit
Python:               3.11.2 (v3.11.2:878ead1ac1, Feb  7 2023, 10:02:41) [Clang 13.0.0 (clang-1300.0.29.30)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.2
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.0
pyarrow:              15.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
None

Process finished with exit code 0

feat: provide a `to_python` method

Provide a to_python method that would convert data to a list[list[int | float | str | datetime | date | timedelta | None]].

Add a parameter to to_pandas and to_polars methods to allow falling back on python object creation in case to_arrow fails

Code license

Are you going to specify a license for this code?
Since it uses apache arrow, maybe you could use the apache license?
I'd like to use it at work, but we can't use code without an open source license.

ARM Mac fails to install package

Colleague of mine tried to build my Dockerfile which installed fastexcel but it failed to find a package. The only difference in our setups is that he is using Mac with ARM architecture.

It seems that there is no wheel for that in the build process (I guess that MacOS images are x86 arch).

I think what could help is to have sdist package so in absence of wheel it can be downloaded and built locally. https://github.com/PyO3/maturin#python-packaging-basics
At leas I think this is what some other projects do: https://github.com/pydantic/pydantic-core/blob/3f7df809123b643f84b9b5e39c008d9300692462/.github/workflows/ci.yml#L350-L363

parameter for dtype override option AND/OR better inference

Here's a file that when I open it with:

filename='2017-2018-annual-auction-round-2-results.xls'
sheetname='Annual 2017-18 Results RD 2'
df = fastexcel.read_excel(filename).load_sheet_by_name(sheetname).to_polars()

then the Source PNODEID column comes through as float64 with a bunch of nulls. I'd prefer it to be an int but it seems it should at least return a String rather than Float with nulls.

My workaround is to use calamine-python like this, not sure if there's a better way

cal=CalamineWorkbook.from_path(filename)
df2=pl.from_records(cal.get_sheet_by_name(sheetname).to_python(), orient='row')
df2.columns=df2.rows()[0]
df2.with_columns(pl.col('Source PNODEID').cast(pl.Int64)) 
## I have to cast all the non-string columns but this works, no value raises

To sum up:

  1. The type inference has an issue here.
  2. There should be a way to override type inference

[docs] Provide a section about type inference

It'd be great to have a section explaining a bit how type inference works in the docs. Specifically:

  • How dtypes are inferred
  • Which type combinations get coerced to what
  • How specifying dtypes explicitly influences that

request: arm64 linux wheels

hey guys, any chance of publishing wheels for linux/arm64? the lack of an sdist for this means we're kinda stuck without it.
Is this on the roadmap anyway or would you appreciate a PR adding it to your release action?

Thanks for maintaining this!

Improve schema inference - wrongly assigns null dtype and misses reading valid data at end of large file

Team,

Trying to read large xlsx file ( around 500k rows with 8 columns ), was using Pandas with calamine engine before, all good and got to know about polars with fastexcel [calamine], tried to adopt it and found this bug.

Observation:

During reading above large xlsx file with 1st column having valid value only at 143,407th row on overall 500k records xlsx file, I observe two issues:

  1. the 1st column is inferred with dtype of "null" as in below snap:

image

  1. the 1st column is left empty in the csv file, even with valid values present from 143,407th row onwards in xlsx file. Validated this when i tested data by writing the read excel dataframe into csv.

Code details:

  1. Polars code

import polars as pl df = pl.read_excel(source = <<path of test.xlsx (attached)>>, sheet_id=1,engine="calamine", raise_if_empty = False print(df.sample(10))

  1. Pandas code

import pandas as pd df = pd.read_excel(path, sheet_name=0, engine="calamine", dtype=str,na_filter=False)

Step to reproduce the issue:

Have the attached test.xlsx file with 1st column all empty/null and run above polars code. It prints 1st column inferred data type as null.

Please refer to pandas code above, that works for me without any data loss.

Technical Feature requests to resolve the issue:

1, ExcelReader option to read xlsx fields "as-is" / as Strings [without schema inference]
2, ExcelReader option to control Schema Inference row limit [ Hope it will help other use cases with less records, but not in my use case ]

System/lib details:
Windows OS , Python 3.11.5, polars==0.20.16, fastexcel==0.9.1,pandas==2.2.1,python-calamine==0.2.0

test.xlsx

Add skip_rows and n_rows

The idea of skip_rows and n_rows is to be able to have only a part of the data
skip_rows after the header
n_rows is the number of rows to retrieve

"use_columns" keyword argument not recognized by `load_sheet`.

According to https://fastexcel.toucantoco.dev/fastexcel.html#ExcelReader.load_sheet, there is a "use_columns" keyword
argument, but in fact the keyword is not recognized.

>>> import fastexcel
>>> te = fastexcel.read_excel("1.xlsx")
>>> te
ExcelReader<1.xlsx>
>>> te.load_sheet(0, use_columns="A:E")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ExcelReader.load_sheet() got an unexpected keyword argument 'use_columns'
>>>

Columns containing missing values will render entire column as `NaN`

(
    fastexcel.read_excel(file)
    .load_sheet(idx_or_name=0, header_row=4)
    .to_pandas()
    .iloc[:, 1:]
)

When reading an Excel file, if a column has missing values, it will render the entire column as NaN (even if it is not missing).

image

Upon running .info()

(
    fastexcel.read_excel(file)
    .load_sheet(idx_or_name=0, header_row=4)
    .to_pandas()
    .iloc[:, 1:]
    .info()
)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9438 entries, 0 to 9437
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Date              9435 non-null   object 
 1   Transaction Type  9435 non-null   object 
 2   Num               9435 non-null   float64
 3   Name              9435 non-null   object 
 4   Memo/Description  0 non-null      object 
 5   Due Date          9435 non-null   object 
 6   Amount            9435 non-null   float64
 7   Open Balance      9435 non-null   float64
 8   Jurisdiction      0 non-null      object 
 9   Project Manager   0 non-null      object 
 10  A/R Paid          9435 non-null   object 
 11  Project Address   0 non-null      object 
 12  Permit No         0 non-null      object 
 13  Sales Rep         0 non-null      object 
dtypes: float64(3), object(11)
memory usage: 1.0+ MB

[CI] Fix docs jobs

The docs job fails on the "commit" step when no changes happened to the docs. Make the script a bit smarter and exit if there's nothing more to be committed

ExcelReader parses "NULL" as string values, instead of empty/null values

Hello, I'm following up on pola-rs/polars#14495 as you requested. After a bit more digging, I figured out where part of the issue is.

Attached are three short example files, containing:

  • TEST_FASTEXCEL_WITH_NULLS.xlsx
    • 6 columns with 10 records each and a header, each column formatted as the data type is supposed to represent (integer, date, timestamp, float), but including "NULL" as values, representing empty values.
  • TEST_FASTEXCEL_WITHOUT_NULLS.xlsx
    • Same as above, but removing "NULL" values and leaving those cells empty instead
  • TEST_FASTEXCEL_MIXED_DATA.xlsx
    • Same as the prior one, but introducing two additional columns with mixed data types (integer plus string, date plus string, no "NULL" string values this time)

As it stands, Fastexcel (via polars) is not able to infer mixed types in the first and third examples/files, but can and does load the second file whilst inferring the data types correctly. This indicates that cells containing "NULL" are read as strings, instead of empty values.

I hope this helps. The string fallback conversion would be good, but given that "NULL" values are commonplace, especially in the context of massive CSV created from SQL dumps, I think addressing this first would fix a lot of loading issues.

Thank you for working on this wrapper!

Cannot handle special symbols

Hi team,
I am facing one issue on reading excel sheet through Polars. It says calamine cell error: #VALUE!
The sheet in interest does not get read through standard api.

How to handle such special symbols through fastexcel?

Thank you.

I am using fastexcel==0.10.2.

Empty sheet crash

When we load a empty sheet fastexcel is crashing. We may have a more graceful approche.
We could return a empty recordbatch ?

When the bug is fixed you may also update the benchmark test file.

`column_names` are taken partially when `use_columns` does not include all columns

Given this worksheet data (without any empty area):

A B C
21 22 23
31 32 33
41 42 43

The following code:

import fastexcel as fe


file = r'<path to file>'
params = {'idx_or_name': 0, 'header_row': None, 'skip_rows': 1, 'use_columns': [1, 2], 'column_names': ['Col B', 'Col C']}

print(fe.read_excel(file).load_sheet(**params).to_polars())

Outputs:

shape: (3, 2)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Col C โ”† __UNNAMED__2 โ”‚
โ”‚ ---   โ”† ---          โ”‚
โ”‚ f64   โ”† f64          โ”‚
โ•žโ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
โ”‚ 22.0  โ”† 23.0         โ”‚
โ”‚ 32.0  โ”† 33.0         โ”‚
โ”‚ 42.0  โ”† 43.0         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Expected output:

shape: (3, 2)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Col B โ”† Col C โ”‚
โ”‚ ---   โ”† ---   โ”‚
โ”‚ f64   โ”† f64   โ”‚
โ•žโ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•ก
โ”‚ 22.0  โ”† 23.0  โ”‚
โ”‚ 32.0  โ”† 33.0  โ”‚
โ”‚ 42.0  โ”† 43.0  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

improve README + doc

  • add getting started in README and doc
  • add badges
  • add CONTRIBUTING
  • add PULL REQUEST template

Straightforward but important for repo quality

Implement multiple header_rows

I find that people will often have Excel sheets where they use multiple header rows to make up their column names.

Here's a snippet of how I deal with that coming from a df generated with python_calamine

header_merge_char="_" # for example
header_rows = 3 # again, for example
df.columns=[
    header_merge_char.join([y for y in x if y != ""])
    for x in zip(*[df.rows()[x] for x in range(header_rows)])
]

this is just a snippet and doesn't handle duplicate column names but that's a separate issue.

A more advanced version of this might infer the header_rows by skipping down (let's say) 10 rows and look for types starting there. Then choose a column which isn't a string and go back to the true row=0 and see how many rows down it needs to go before it no longer sees strings. Then that's the inferred header_rows

Move Arrow Schema out of ExcelSheet struct

The schema should not be part of the excelsheet struct.
Currently it is needed only in the TryFrom<&ExcelSheet> for RecordBatch implementation. So we should be able to create it when needed.

Allow to disable skipping empty rows/columns at the beginning of the worksheet

There is a similar issue, which is closed without, as it appears, any fix (with v0.9.1 being actual).

Expected behaviour is outlined in examples of python-calamine for an option skip_empty_area:

from python_calamine import CalamineWorkbook

workbook = CalamineWorkbook.from_path("file.xlsx").get_sheet_by_name("Sheet1").to_python(skip_empty_area=False)
# [
# [",  ",  ",  ",  ",  ",  "],
# ["1",  "2",  "3",  "4",  "5",  "6",  "7"],
# ["1",  "2",  "3",  "4",  "5",  "6",  "7"],
# ["1",  "2",  "3",  "4",  "5",  "6",  "7"],
# ]

This automatic behavior is kind of surprising when dealing with files with empty parts at the beginning, as the whole rows calculations get confusing. i.e. for header_row you have to count rows as they appear (i.e. including empty rows), but for skip_rows you have to count as if there are no empty rows, which is clearly not the most user-friendly approach.

Please consider adding an option/parameter to disable default behavor.
Thank you!

Support for Reading Excel Tables

From my experience it's usually much safer to load data from an excel table than from a sheet. would be nice if one could get the table names per sheet and get the table data as arrow/pandas like with the sheets

Provide `abi3` wheels

abi3 wheels would allow us to build for our lowest supported Python version only (currently 3.8), while still being compatible with higher versions. This would allow to produce a single artifact per target plaform/arch rather than a wheel per platform/arch/version: https://pyo3.rs/v0.14.5/building_and_distribution#py_limited_apiabi3

It is also what polars does currently

Pros:

  • Less artifacts to build
  • Faster CI (much smaller wheel build check matrix)

A potential downside could be a performance cost. To quote PyO3's docs:

The downside of this is that PyO3 can't use optimizations which rely on being compiled against a known exact Python version. It's up to you to decide whether this matters for your extension module.

We'd have to run a benchmark with both types of wheels in Python 3.12 to check the impact of this, but since our API is quite simple, we shouldn't see a difference

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.