Comments (5)
Regarding drop and relocate, they're both implemented using a somewhat naive approach:
drop
: project every column except the input columnsrelocate
: basically a reprojection that loops through every column
I suspect that for drop
we can turn it into a special operation that we should at least be able to make the operation scale with the number of dropped columns instead of the total number of columns. A place to start might be:
- create a special
Drop
relational operation - implement the column drop in select merging by looping over the drop list and removing each element of the drop list from the
Select
fields.Select
fields are dicts, so that should get us toO(droplist)
instead ofO(allcolumns)
.
We may be able to take a similar approach for relocate by using a data structure more optimized for the operation. I think something that has fast ordered inserts (perhaps sortedcontainers
has a sorted map container?) might be a good place to start.
from ibis.
I've benchmarked Ibis 8 vs. 9 on 55 Ibis expressions which are part of a Data Warehouse built using https://github.com/binste/dbt-ibis. I've measured:
- Execution of Ibis code: The time it takes to execute the Ibis code to get to the final expression
- Convert Ibis expression to SQL: The time it takes to compile the final Ibis expression to SQL
Here some pseudo code to illustrate
import ibis
t = ibis.table(...)
# --------------------------
# START: Execution of Ibis code
# --------------------------
t = t.group_by(...).select(...)
# --------------------------
# END: Execution of Ibis code
# --------------------------
# --------------------------
# START: Convert Ibis expression to SQL
# --------------------------
ibis.to_sql(t)
# --------------------------
# END: Convert Ibis expression to SQL
# --------------------------
The great news is that the compilation to SQL got significantly faster with the move to SQLGlot which is super nice! :) The execution of Ibis code on the other hand got a bit slower with one expression taking significantly longer with 11 seconds. I've profiled that expression and most time is spent in the following statements:
- Table.relocate ibis/expr/types/relations.py:4279
- Code here looks like this:
.relocate("valid_to", after="valid_from")
- Code here looks like this:
- Join.wrapper ibis/expr/types/joins.py:188
- Table.drop ibis/expr/types/relations.py:2329
- Table.rename ibis/expr/types/relations.py:2140
Hope this helps!
from ibis.
@binste Thanks for digging into this. Interesting results!
Can you share the query that's now taking 11 seconds with Ibis 9.0?
from ibis.
Unfortunately, I can't as I'd need to mask column names and code logic for IP reasons. It's 2 input tables with each around 50 columns and 1 table with ~10 columns and then various operations on top of it. But happy to test out any PRs if there is a wheel file available!
from ibis.
Naive benchmark here, but for a quick test:
# drop_test.py
import ibis
import time
from contextlib import contextmanager
@contextmanager
def tictoc(num_cols):
tic = time.time()
yield
toc = time.time()
print(f"| {num_cols} | {toc - tic} seconds")
print(f"{ibis.__version__=}")
for num_cols in [10, 20, 50, 100, 200, 500, 1000]:
t = ibis.table(name="t", schema=[(f"a{i}", "int") for i in range(num_cols)])
with tictoc(num_cols):
t.drop("a8")
🐚 python drop_test.py
ibis.__version__='8.0.0'
| 10 | 0.18016910552978516 seconds
| 20 | 0.0016529560089111328 seconds
| 50 | 0.0037081241607666016 seconds
| 100 | 0.007429361343383789 seconds
| 200 | 0.013902902603149414 seconds
| 500 | 0.03390693664550781 seconds
| 1000 | 0.06650948524475098 seconds
🐚 python drop_test.py
ibis.__version__='9.0.0'
| 10 | 0.002298593521118164 seconds
| 20 | 0.005956888198852539 seconds
| 50 | 0.027918338775634766 seconds
| 100 | 0.09690213203430176 seconds
| 200 | 0.36721324920654297 seconds
| 500 | 2.301510810852051 seconds
| 1000 | 9.317416191101074 seconds
from ibis.
Related Issues (20)
- feat(pyspark): support udaf
- bug HOT 1
- bug: Oracle Table alias HOT 8
- bug: Installation issue: mamba on Windows 10 HOT 6
- feat: Table.to_records()
- feat: `backend.db_params`
- `collect` does not respect order HOT 2
- bug Error when connecting to Trino after upgrading to 9.0.0
- bug: table formatting characters don't render monospace with some fonts HOT 5
- bug: `read_parquet` and similar methods silently overwrite tables HOT 2
- bug: names_sort argument in table.pivot_wider has no effect HOT 1
- feat: add a method for table existence check HOT 5
- bug: `create_table(temp=True)` timing out due to slow table existence check HOT 2
- add support for TIMESTAMPTZ HOT 4
- feat: add a table_exists(table_name) api HOT 1
- bug: `to_sql` always shows DuckDB SQL for a memtable even if there's a default backend set HOT 1
- Polars backend can read only 1 csv HOT 1
- docs: add ops docstrings
- ci: more granular cloud run labels
- feat: allow to connect to a duckdb named in-memory database
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ibis.