Comments (17)
The workarounds aren't that clear IMO. It's not like we'd recognize the pattern and suggest an alternative. However, fusing makes the solution of "spell things out bit by bit" impossible to write.
from ibis.
Thanks for the report @NickCrews -- we'll probably disable merging filters across selects to avoid this happening. In the interim (and we won't release 9.0 until we sort this out), you can probably use try_cast
to handle this without the failure mode.
from ibis.
The produced expression then the sql specific Select
operations looks correct to me. I understand that this data-dependent case works if the filters are not fused, but ibis expressions shouldn't have assumptions about the underlying data.
The minimal problem is the following:
t.filter(_.votes != "*").filter(_.votes.cast("int64") > 0)
gets reduced to
t.filter(_.votes != "*", _.votes.cast("int64") > 0)
and the latter fails due to the fallible cast operation.
By not fusing the operations these two will behave differently depending on the data which might be considered a downside as well. This is not something we can make foolproof since the user may use the compound predicate directly t.filter(_.votes != "*", _.votes.cast("int64") > 0)
.
The predicate which should be really used for this case is:
t.filter((_.votes != "*").ifelse(_.votes.cast("int64") > 0), False)
Or rather a new API allowing to chain ands more conveniently:
t.filter((_.votes != "*").and_then(_.votes.cast("int64") > 0)
from ibis.
I understand that this data-dependent case works if the filters are not fused, but ibis expressions shouldn't have assumptions about the underlying data.
Removing filters for this case removes assumptions, it doesn't add any. Are you saying something else?
from ibis.
By not fusing the operations these two will behave differently depending on the data which might be considered a downside as well.
This can't be addressed statically (which means definitely not in any tool that generates code without knowing what the data are), since it's data dependent. Not fusing doesn't make that particular issue worse or better, but fusing definitely makes it worse.
from ibis.
but fusing definitely makes it worse
I kinda disagree with this, I think fusing makes it just as bad (not better not worse [semantically]).
Additionally there are clear workarounds to either use .ifelse()
or .try_cast()
.
from ibis.
and the latter fails due to the fallible cast operation.
Correct, as it should. I don't think we need a new method here. An ibis .filter
call should semantically appear to the user to run as a subquery with a separate WHERE
clause. The backend is free to fuse steps as needed, but we need to ensure the semantically the results are the same.
So
t.filter(_.x != "*").filter(_.x.cast("int64") > 0)
should compile to something semantically identical to (all sql typos mine):
SELECT * FROM (
SELECT * FROM t WHERE x <> '*'
) WHERE CAST(x AS int64) > 0
Whereas
t.filter(_.x != "*", _.x.cast("int64") > 0)
should be semantically identical to:
SELECT * FROM t WHERE x <> '*" AND CAST(x as int64) > 0
These aren't semantically identical queries, and the latter may not work appropriately in this case since it's gating the CAST
operation on an AND
in the same filter (and backends provide no guarantee on AND
shortcircuiting).
We can fuse if we want to produce prettier sql, with the caveat that we don't want to change the semantics. In the case of an operation like CAST
which can error we can't fuse filters like this. Enumerating the cases where we can fuse properly seems tricky to do and probably not something we should introduce in 9.0 given all the recent bugs. I agree with @cpcloud here - we should just disable fusing to avoid this issue.
from ibis.
I would prefer catching that my query is fallible early.
Imagine that the nested query successfully executes then a new record gets inserted into the data source (not a parquet file, but a live database table) with _.x = "uncastable"
, the ibis query will fail the same way just later.
from ibis.
Just checked and this has been the behaviour for earlier ibis versions as well:
In [16]: t = ibis.read_parquet(
...: "https://github.com/NickCrews/election-data/releases/download/v2024-04-05_0/cleaned.parquet"
...: )
...: t = t.cache()
100% ▕████████████████████████████████████████████████████████████▏
In [17]: t.filter(_.votes != "*").filter(_.votes.cast("int64") > 0)
Out[17]:
r0 := DatabaseTable: ibis_cache_7s6eudf4vbfqzhs7dpm52nqafu
year int16
date date
state_po string
county_name string
county_fips string
jurisdiction_name string
jurisdiction_fips string
district string
office string
magnitude int64
special boolean
stage string
precinct string
writein boolean
candidate string
party_detailed string
mode string
votes string
readme_check boolean
county_fips2 string
jurisdiction_fips2 string
Selection[r0]
predicates:
r0.votes != '*'
Cast(r0.votes, to=int64) > 0
In [19]: print(t.filter(_.votes != "*").filter(_.votes.cast("int64") > 0).compile())
SELECT t0.year, t0.date, t0.state_po, t0.county_name, t0.county_fips, t0.jurisdiction_name, t0.jurisdiction_fips, t0.district, t0.office, t0.magnitude, t0.special, t0.stage, t0.precinct, t0.writein, t0.candidate, t0.party_detailed, t0.mode, t0.votes, t0.readme_check, t0.county_fips2, t0.jurisdiction_fips2
FROM ibis_cache_7s6eudf4vbfqzhs7dpm52nqafu AS t0
WHERE t0.votes != :param_1 AND CAST(t0.votes AS BIGINT) > CAST(:param_2 AS INTEGER)
from ibis.
I don't think that matters too much, given that this is incorrect behavior given what the user has written.
I'm still not understanding what the rationale is for keeping the merging given all the issues we have encountered so far.
from ibis.
- I still think that merging is correct. For example polars does the same thing (and I assume other systems as well):
In [1]: import polars as pl
...:
...: lf = pl.LazyFrame(
...: {
...: "foo": [1, 2, 3],
...: "bar": [6, 7, 8],
...: "ham": ["a", "b", "c"],
...: }
...: )
...: hamint = pl.col("ham").cast(pl.Int64)
...: expr = lf.filter(pl.col("foo") > 1).filter(hamint > 2)
...:
...: print(expr.explain())
DF ["foo", "bar", "ham"]; PROJECT */3 COLUMNS; SELECTION: "[([(col(\"ham\").strict_cast(Int64)) > (2)]) & ([(col(\"foo\")) > (1)])]"
- Even if we consider it as an incorrect functionality, it is not a regression.
- I tried disabling the merge in #9066 and there are new issues arising from it, especially with mssql, so we need to implement workaround rewrites for those cases.
from ibis.
_.x = "uncastable"
I'm not sure I understand, can you give a more concrete example?
from ibis.
Consider
def load():
...
t = t.distinct()
t = t.filter(t.votes != "*")
return t
# many lines
t = load()
t = t.filter(t.votes.cast(int) > 0)
- If you look at each part of this program locally, it appears correct. It only breaks if these two distant parts of the program interact in an unlucky way.
- If you switched the order of the .distinct() to be the last statement in load(), it would succeed, even though semantically that yields the same data out of load(). Correctness hinging upon an implementation detail like this smells bad.
from ibis.
If you switched the order of the .distinct() to be the last statement in load(), it would succeed
Wouldn't it only succeed if the votes column contains valid integer strings or "*"? What I am trying to say that it is entirely data dependent and t.votes.cast(int)
is fallible hence the query may or may not fail.
Do you have an example where the filters are not fallible end the merge alters the behaviour?
from ibis.
Wouldn't it only succeed if the votes column contains valid integer strings or "*"?
Yes, that is the scenario I am considering here, the data only contains valid ints or "*". In that scenario:
t.filter(_.votes != "*").distinct().filter(_.votes.cast(int) > 0)
: This SHOULD never error, and in the current implementation, it never errors.t.distinct().filter(_.votes != "*").filter(_.votes.cast(int) > 0)
: again, this SHOULD never error (per my above argument of "things that look locally correct should be globally correct" argument, and phillip's "spell things out bit by bit" argument), but in the current implementation, it sometimes errors.
from ibis.
things that look locally correct should be globally correct
The second part of your example looks fallible to me, so no surprise that I am getting a cast error since I cannot have any knowledge about what load does but I do know that t.votes
is a string column.
t = load()
t = t.filter(t.votes.cast(int) > 0)
Apparently this only appears to be an issue for duckdb. At least I am unable to reproduce the problem on other backends which suggest that our .cast()
behaviour if fairly inconsistent.
I would still like to have an example where the problem occurs without calling .cast()
.
from ibis.
Reposting my comment from #9065 (comment)
I took the weekend to think about some of the issues here and did a bit of research on the SQL standard specification of predicate short-circuiting.
- According to this SO Q&A the SQL standard does not specify the evaluation order of predicates. That means that an engine is free to reorder (or not) predicates in any way AFAICT, including if it would change the semantics of the query.
For example, an engine can evaluate a fallible expression first, even if a subsequent expression with short circuiting would avoid the fallible expression evaluation failing.
In practice, DuckDB merges a fallible cast with an infallible operation in the same way producing the exact same plan regardless of how the SQL is specified:
D explain select * from (select * from t where cast(x as int) != 0) where x <> '*';
┌─────────────────────────────┐
│┌───────────────────────────┐│
││ Physical Plan ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│ FILTER │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ ((x != '*') AND (CAST(x AS│
│ INTEGER) != 0)) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 1 │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ SEQ_SCAN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ t │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ x │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 5 │
└───────────────────────────┘
D explain select * from (select * from t where x <> '*') where cast(x as int) != 0;
┌─────────────────────────────┐
│┌───────────────────────────┐│
││ Physical Plan ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│ FILTER │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ ((x != '*') AND (CAST(x AS│
│ INTEGER) != 0)) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 1 │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ SEQ_SCAN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ t │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ x │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 5 │
└───────────────────────────┘
This means that even if Ibis decided to generate subqueries that map one-to-one to SQL selects it provides absolutely zero guarantee that generating the SQL in that way produces specific behavior.
IMO this means that Ibis should be free to merge predicates, assuming our implementation of that process is correct.
Now, I'm not trying to gloss over the fact that we have had a number of issues with select merging, and so I think we should provide a way to disable it if it can potentially help unblock users.
I think the opt-in behavior would be kind of pointless, since people will never opt-in without lots of documentation and if we're going to go that route we should disable it.
On the other hand, if we think this is long-term a thing we want enabled, then we definitely need feedback on it from users. The only way to get that feedback AFAICT is to enable this feature.
That does mean we have to accept that the current implementation may not be 100% finished and will elicit bug reports.
The opt-out behavior IMO seems like an acceptable balance here: people can disable select merging on an as-needed basis, and we can continuously evaluate whether keeping it around is causing overhead that isn't worth it.
Disabling select merging it turns out is also not without issue: SQLite fails due to a query we generate being too large for the parser stack, and MS SQL loses the ability to specify order by in many cases as well as LIKE
functionality in certain positions.
from ibis.
Related Issues (20)
- bug: april fools day ibish import failing on mac osx m1 HOT 5
- feat(api): Add `order_by` parameter to the `collect()` method HOT 3
- Postgres error: DateDelta HOT 2
- feat(pyspark): support udaf
- bug HOT 1
- bug: Oracle Table alias HOT 8
- bug: Installation issue: mamba on Windows 10 HOT 6
- feat: Table.to_records()
- feat: `backend.db_params`
- `collect` does not respect order HOT 2
- bug Error when connecting to Trino after upgrading to 9.0.0
- bug: table formatting characters don't render monospace with some fonts HOT 5
- bug: `read_parquet` and similar methods silently overwrite tables HOT 2
- bug: names_sort argument in table.pivot_wider has no effect HOT 1
- feat: add a method for table existence check HOT 5
- bug: `create_table(temp=True)` timing out due to slow table existence check HOT 2
- add support for TIMESTAMPTZ HOT 4
- feat: add a table_exists(table_name) api HOT 1
- bug: `to_sql` always shows DuckDB SQL for a memtable even if there's a default backend set HOT 1
- Polars backend can read only 1 csv HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ibis.