Comments (5)
The solution that I have used with polars is by creating the type_map
for all 16384 columns as this is the limit of an excel file. Of course I am not creating it for the all columns but for my purpose with polars I have set it for the first 100 columns. This returns the correct dataframe. The extra columns in the type_map
are not created at all, so therefore you are able to create a type_map
without knowing the width of the worksheet in the first place.
from fastexcel.
I don't really understand.
schema_sample_rows
sets the number of rows used to infer the type of the columns.
If you set it to 0, we can't infer anything because no data is read hence the null
type being generated
from fastexcel.
I understand that the type cannot be inferred, but why would you replace all the values with null then?
I have a table filled with values, but I do not want to have any types inferred as I want to assign them by myself.
However, the columns are all filled with null values instead of the values that are present in the column.
I would be fine with the fact that the column dtype is null, but then all the values are kept as strings for example.
In that way I would be able to cast the columns by myself, to the values that I want them to be and prevent the extra overhead of inferring the dtype.
from fastexcel.
For example the following excel on the left with schema_sample_rows = 0 results in the table on the right after reading it with all dtypes being null.
Another example is the following excel on the left with schema_sample_rows = 1 results in the table on the right after reading it with the dtypes of the first three columns being string and the last column being of dtype null.
I do not see why this is wanted behavior to remove all data from the column if the dtype cannot be inferred.
Could you please elaborate why this is the default behavior and how to prevent the replacement in way that is not casting all 16384 columns with the dtypes parameter to type string.
from fastexcel.
I ran into this as well. I have very inconsistent excel files and the requirement is to load all data as strings. Since I don't know how many columns there are, I am currently doing a 3-step process: load the first row, build a dtype map based on the width of the row, then load the sheet using the map. The time hit for the double load is noticeable for larger worksheets. I am certainly open to better solutions. I have tried some combinations of Polars
and fastexcel
and always run into trouble... dropped columns/null data at one end and unsupported column type combination: {DateTime, String}
at the other.
I was also going to open a feature request to pass a single dtype
to apply to all columns.
ws = wb.load_sheet_by_name(name=sheet_name, header_row=None, n_rows=1, schema_sample_rows=0)
type_map = {i: "string" for i in range(ws.width + 1)}
ws = wb.load_sheet_by_name(name=sheet_name, header_row=None, dtypes=type_map)
from fastexcel.
Related Issues (20)
- feat: provide a `to_python` method
- refactor: support mixed indices and column names for dtypes and column selection
- When using fastexcel to read Excel files, empty rows are automatically ignored by default HOT 1
- Support for Reading Excel Tables HOT 6
- "use_columns" keyword argument not recognized by `load_sheet`. HOT 2
- Improve schema inference - wrongly assigns null dtype and misses reading valid data at end of large file HOT 5
- Allow to disable skipping empty rows/columns at the beginning of the worksheet HOT 2
- `column_names` are taken partially when `use_columns` does not include all columns HOT 4
- [docs] Provide a section about type inference
- Create conda package for install with `conda`/`mamba` HOT 3
- Cannot handle special symbols HOT 9
- load_sheet_by_name raises generic CalamineError instead of SheetNotFoundError HOT 1
- Add `typing-extensions` to dependencies HOT 2
- Option to retain special symbols/ errors in excel as a string HOT 1
- usecols - don't require full range
- header_row does not work as expected if there are blank rows HOT 5
- Xlsx Chart Sheet Crash HOT 4
- feat: support callable in use_columns parameter
- Add an option to disable automatic type coercion
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fastexcel.