Comments (6)
@Dandandan I am working on a fix for this issue. I will create a PR this week for your approval.
Currently, Join Predicates which are part of OR Expressions are not handling properly in planner.rs. I handle this scenario and ensure that Join Predicates are handled properly. With this fix, Q19 would run faster without any change in Query. Thanks
from arrow-datafusion.
@alamb I am wondering if you think it would be the solution to introduce this boolean optimization (Distributivity of and over or):
(p and q1) OR (p and q2) should be replaced by p AND (q1 or q2)
p in this example could be our join predicate.
After this optimization, the physical planner should be able to see the predicate in the filter.
from arrow-datafusion.
@Dandandan -- yes, I think the "classic" thing to do is a "predicate rewrite" pass that rearranges predicates for further downstream operations
The goal is basically to get the predicate into a form of
good_predicate1
AND good_predicate2
AND ...
Where good_predicate
means the predicate has special support in the execution engine.
Since OR is not typically handled specially, rewrites to AND are helpful
Rewrite 1: (p and q1) OR (p and q2) OR (p and ..) ==> p AND (q1 or q2)
Another common rewrite I have seen is
(col1 = A) OR (col1 = B) OR (col1 = C) ==> col1 IN (A, B, C)
Which then the execution engine can treat like a single column predicate (push down to scan) and build a hash table for (A, B, C)
and do fast filtering.
Shall I file an issue to track this kind of rewrtite?
from arrow-datafusion.
Cool, thanks @alamb !
I guess there are also some more opportunities for rewriting in the above example:
l_shipmode in (‘AIR’, ‘AIR REG’)
and and l_shipinstruct = ‘DELIVER IN PERSON
are repeated and could benefit from being simplified and then pushed down.
p_container in ( ‘SM CASE’, ‘SM BOX’, ‘SM PACK’, ‘SM PKG’)
, p_container in (‘MED BAG’, ‘MED BOX’, ‘MED PKG’, ‘MED PACK’)
,p_container in ( ‘LG CASE’, ‘LG BOX’, ‘LG PACK’, ‘LG PKG’)
could be adding a filter with all the items.
p_size between 1 and 15
, p_size between 1 and 10
, p_size between 1 and 5
could generate a filter with the least restrictive filter p_size between 1 and 15
Shall I file an issue to track this kind of rewrtite?
Would be nice to have some tracking issue 👍
from arrow-datafusion.
Filed #217 to track the idea
from arrow-datafusion.
Thanks @DhamoPS -- note that @xudong963 implemented #2858 which may have helped
from arrow-datafusion.
Related Issues (20)
- PhysicalOptimizer rule 'ProjectionPushdown' failed, due to generate a different schema HOT 1
- Attempt to subtract with overflow panic in `max_distinct_count`
- Optimizer schema change assert error has incorrect error
- Extra repartitions in physical plan and strange optimizer behavior overall HOT 5
- Document streaming usecase (like `UNBOUNDED` tables) HOT 13
- support `list_reverse` in array HOT 6
- Add a make_date function HOT 3
- instr function doesn't correctly account for non-ascii strings HOT 3
- `NthValueAgg` returns 1 physical expression, but needs 2 when being constructed with `create_aggregate_expr` HOT 1
- DataFusion weekly project plan (Andrew Lamb) - Jan 29, 2024 HOT 6
- ScalarUDF with zero arguments should be provided with one null array as parameter
- support list slicing with negative index HOT 2
- How to call a asynchronous function in a synchronous function? HOT 6
- strpos doesn't account for unicode grapheme characters HOT 4
- upper and lower functions only work correctly on ascii characters HOT 1
- Implement IGNORE NULLS for window functions HOT 8
- [EPIC] Support Exprs as equality join predicates (constants) rather than just Columns in physical joins HOT 3
- Remove custom doubling allocation strategy in HOT 1
- Internal error when negating strings HOT 5
- Add `FunctionRegistry::register_udaf` and `FunctionRegistry::register_udwf`
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from arrow-datafusion.