Comments (12)
Sounds good to me!
I've seen those others that you haven't done and they are indeed hard, I will probably have some questions when I try them.
from arrow-datafusion.
Hmm yeah, but it does sound better. Guess I'll have to try and test them out, maybe I'll stumble onto something else. Thanks for helping!
from arrow-datafusion.
Alright, I've never done the rebase thing, it does sound nicer than 2 separate branches, I guess I can try it out now. You'll see the PR for cases 1, 2 and 3 tomorrow.
from arrow-datafusion.
Can i help with this one?
from arrow-datafusion.
Sounds good to me -- once #10963 is merged, I'll mark off the ones it updates, then you can take the rest? I took the easy ones on the first go 😅
from arrow-datafusion.
I just saw @tshauck's PR was approved by 2 reviewers, so i think i can assume it will be merged in the future. I was looking at the remaining cases and wanted to brainstorm in case anyone wanted to give feedback on my thoughts before I try to implement them.
Currently we have following cases left:
- For LeftAnti Join, if the right side is empty, the Join result is left side(should exclude null ??).
- For RightAnti Join, if the left side is empty, the Join result is right side(should exclude null ??).
- For Full Join, only both sides are empty, the Join result is empty.
- For LeftOut/Full Join, if the right side is empty, the Join can be eliminated with a Projection with left side
columns + right side columns replaced with null values. - For RightOut/Full Join, if the left side is empty, the Join can be eliminated with a Projection with right side
columns + left side columns replaced with null values.
Case 3
I think this one is trivial. Like tshauck did, just return an empty relation.
Case 1 and 2
Anti Joins result in rows which don't have any matching join column values in the other table. So in the case when there are NULL
's present in the Left (Right) table during a Left (Right) Anti Join, these will be included in the result because any comparison with NULL
always returns False. DataFusion offers a way to say that NULL = NULL
will result in true via the null_equals_null field on the Join
node, so there are cases were these NULL
s are excluded. But when the other table is empty, these NULL
values can't match on anything, regardless of the null_equals_null
field. So I think in this case that NULL
s should not be excluded because of the default behavior.
Case 4 and 5
I understand why this should be done this way and agree with it. But I don't know how to do this in the Logical Optimizer.
For example:
Table A
ID | Name |
---|---|
1 | John |
2 | Lisa |
3 | Bob |
And Table B, which is empty
ID | Age |
---|
When we do A Left Outer Join B Using ID
we expect the result to look like this (of course, in the same way for Right Anti):
ID | Name | Id | Age |
---|---|---|---|
1 | John | NULL | NULL |
2 | Lisa | NULL | NULL |
3 | Bob | NULL | NULL |
How can I implement this nicely?
I thought of using Projection::new_from_schema
with the schema of the original JOIN, and let DataFusion insert those NULL
values when executing that Projection. But this can't be the correct way?
Thanks in advance to anyone who responds!
from arrow-datafusion.
For cases 4 and 5, this is kinda what you're getting at already, but there's also Projection::try_new_with_schema
... maybe you could get the projection from the left side then augment it with something like lit(ScalarValue::Null)
for the righthand columns. Not exactly nice either tho.
from arrow-datafusion.
You'll probably see otherwise, but #10963 merged. Please LMK if I can help on any of these.
from arrow-datafusion.
Yeah, I need some help with Cases 4 and 5. Creating a projection of the left or right child and extending its schema from the other table doesn't work. I first tried my suggestion, but that failed very quickly; then I tried yours, which got me closer.
I'm currently doing this:
// For LeftOut/Full Join, if the right side is empty, the Join can be eliminated with a Projection with left side
// columns + right side columns replaced with null values.
JoinType::Full | JoinType::Left if right_empty => {
dbg!("here");
Ok(Transformed::yes(LogicalPlan::Projection(
Projection::try_new_with_schema(
join.schema
.columns()
.into_iter()
.map(|col| {
// transform columns from other schema into NULLs
if join.right.schema().is_column_from_schema(&col)
{
Expr::Literal(ScalarValue::Null)
} else {
Expr::Column(col)
}
})
.collect(),
join.left.clone(),
join.schema.clone(),
)?,
)))
}
But when I test this out I get a SchemaError because they are different
(I did omit some fields, but they don't matter here)
Original: New:
DFSchema { | DFSchema {
inner: Schema { | inner: Schema {
fields: [ | fields: [
Field { | Field {
name: "a", | name: "a",
data_type: Int64, | data_type: Int64,
nullable: true, | nullable: true,
}, | },
Field { | Field {
name: "b", | name: "b",
data_type: Int64, | data_type: Int64,
nullable: true, | nullable: true,
}, | },
Field { | Field {
name: "c", | name: "c",
data_type: Int64, | data_type: Int64,
nullable: true, | nullable: true,
}, | },
Field { | Field {
name: "b", | name: "right.b",
data_type: Int64, | data_type: Null,
nullable: true, | nullable: true,
}, | },
Field { | Field {
name: "c", | name: "right.c",
data_type: Int64, | data_type: Null,
nullable: true, | nullable: true,
} | }
], | ],
metadata: {} | metadata: {}
}, | },
field_qualifiers: [ | field_qualifiers: [
Some(Bare { table: "left" }), | Some(Bare { table: "left" }),
Some(Bare { table: "left" }), | Some(Bare { table: "left" }),
Some(Bare { table: "left" }), | Some(Bare { table: "left" }),
Some(Bare { table: "right" }), | None,
Some(Bare { table: "right" }) | None
], | ],
} | }
I don't really know how to fix this. I can create a Draft PR if you want, so you can look at my code instead of this comment.
Other ways
I think that we have to ignore these cases, and if they do cause problems on larger tables, then we can optimize the join impl to handle empty join tables. I'm not sure though, I don't have much experience with this so I'm just looking at other possible solutions
Let me know what you think!
from arrow-datafusion.
Interesting... to your later point, you might consider opening a DRAFT PR. And if it's not too much, maybe do an initial one for case 3 then stack the draft of 4 and 5 on it? Case 3 is probably easy to get reviewed and merged, then we can chat about 4 and 5 potentially with additional eyes.
To the actual problem, the difference is obviously in the field qualifiers, so maybe there's a way to keep the right field qualifier from the column and/or update schema after it's done?
from arrow-datafusion.
Sorry, what do you mean by stack the draft for case 4 and 5 on top of the pr for 3?
from arrow-datafusion.
Like make a branch for case 3, open a PR for that, then branch off that one for the changes for cases 4 and 5. You could then rebase if case 3's PR got merged or if it makes more sense, you can merge the 4 and 5 branch into the case 3, then merge that into main... perhaps unnecessarily complicated, vs just two independent branches. Just thinking how to get the easy case merged while working through the more complex case.
from arrow-datafusion.
Related Issues (20)
- Convert `BoolAndOr` to UDAF HOT 1
- Add distinct_on to dataframe api HOT 1
- Implement min/max for interval types
- Pushdown filters that do not reference unested columns HOT 1
- Support named placeholders wherever numeric ones are allowed
- Support for `LargeString` and `LargeBinary` for `StringView` and `BinaryView` HOT 1
- Implement `LIKE` for StringView arrays HOT 1
- Implement `REGEXP_REPLACE` for StringView HOT 2
- Support `String/LargeString` and `Binary/LargeBinary` Parquet Data Page Statistics HOT 5
- Support `Boolean` Parquet Data Page Statistics HOT 8
- Implement SQLancer (a end-to-end SQL fuzz testing library) HOT 6
- Bugs in LCM/GCD scalar functions (found by SQLancer) HOT 1
- Improve `LIKE` performance for Dictionary arrays HOT 1
- to_timestamp functions should preserve timezone from inputs
- Int64 should be coercible to timestamp types
- Potential memory issue when using COPY with PARTITIONED BY HOT 18
- Make modulos with negative float zero compat with other engines HOT 8
- Verify if DISTINCT supports all types, incl binary, complex, etc HOT 4
- Handle overflow in GCD scalar function HOT 1
- Order of Interval Addition Should Affect Final Output HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from arrow-datafusion.