Git Product home page Git Product logo

Comments (17)

cpcloud avatar cpcloud commented on May 25, 2024 1

The workarounds aren't that clear IMO. It's not like we'd recognize the pattern and suggest an alternative. However, fusing makes the solution of "spell things out bit by bit" impossible to write.

from ibis.

gforsyth avatar gforsyth commented on May 25, 2024

Thanks for the report @NickCrews -- we'll probably disable merging filters across selects to avoid this happening. In the interim (and we won't release 9.0 until we sort this out), you can probably use try_cast to handle this without the failure mode.

from ibis.

kszucs avatar kszucs commented on May 25, 2024

The produced expression then the sql specific Select operations looks correct to me. I understand that this data-dependent case works if the filters are not fused, but ibis expressions shouldn't have assumptions about the underlying data.

The minimal problem is the following:

t.filter(_.votes != "*").filter(_.votes.cast("int64") > 0)

gets reduced to

t.filter(_.votes != "*", _.votes.cast("int64") > 0)

and the latter fails due to the fallible cast operation.

By not fusing the operations these two will behave differently depending on the data which might be considered a downside as well. This is not something we can make foolproof since the user may use the compound predicate directly t.filter(_.votes != "*", _.votes.cast("int64") > 0).

The predicate which should be really used for this case is:

t.filter((_.votes != "*").ifelse(_.votes.cast("int64") > 0), False)

Or rather a new API allowing to chain ands more conveniently:

t.filter((_.votes != "*").and_then(_.votes.cast("int64") > 0)

from ibis.

cpcloud avatar cpcloud commented on May 25, 2024

I understand that this data-dependent case works if the filters are not fused, but ibis expressions shouldn't have assumptions about the underlying data.

Removing filters for this case removes assumptions, it doesn't add any. Are you saying something else?

from ibis.

cpcloud avatar cpcloud commented on May 25, 2024

By not fusing the operations these two will behave differently depending on the data which might be considered a downside as well.

This can't be addressed statically (which means definitely not in any tool that generates code without knowing what the data are), since it's data dependent. Not fusing doesn't make that particular issue worse or better, but fusing definitely makes it worse.

from ibis.

kszucs avatar kszucs commented on May 25, 2024

but fusing definitely makes it worse

I kinda disagree with this, I think fusing makes it just as bad (not better not worse [semantically]).
Additionally there are clear workarounds to either use .ifelse() or .try_cast().

from ibis.

jcrist avatar jcrist commented on May 25, 2024

and the latter fails due to the fallible cast operation.

Correct, as it should. I don't think we need a new method here. An ibis .filter call should semantically appear to the user to run as a subquery with a separate WHERE clause. The backend is free to fuse steps as needed, but we need to ensure the semantically the results are the same.

So

t.filter(_.x != "*").filter(_.x.cast("int64") > 0)

should compile to something semantically identical to (all sql typos mine):

SELECT * FROM (
  SELECT * FROM t WHERE x <> '*'
) WHERE CAST(x AS int64) > 0

Whereas

t.filter(_.x != "*", _.x.cast("int64") > 0)

should be semantically identical to:

SELECT * FROM t WHERE x <> '*" AND CAST(x as int64) > 0

These aren't semantically identical queries, and the latter may not work appropriately in this case since it's gating the CAST operation on an AND in the same filter (and backends provide no guarantee on AND shortcircuiting).

We can fuse if we want to produce prettier sql, with the caveat that we don't want to change the semantics. In the case of an operation like CAST which can error we can't fuse filters like this. Enumerating the cases where we can fuse properly seems tricky to do and probably not something we should introduce in 9.0 given all the recent bugs. I agree with @cpcloud here - we should just disable fusing to avoid this issue.

from ibis.

kszucs avatar kszucs commented on May 25, 2024

I would prefer catching that my query is fallible early.

Imagine that the nested query successfully executes then a new record gets inserted into the data source (not a parquet file, but a live database table) with _.x = "uncastable", the ibis query will fail the same way just later.

from ibis.

kszucs avatar kszucs commented on May 25, 2024

Just checked and this has been the behaviour for earlier ibis versions as well:

In [16]: t = ibis.read_parquet(
    ...:     "https://github.com/NickCrews/election-data/releases/download/v2024-04-05_0/cleaned.parquet"
    ...: )
    ...: t = t.cache()
100% ▕████████████████████████████████████████████████████████████▏

In [17]: t.filter(_.votes != "*").filter(_.votes.cast("int64") > 0)
Out[17]:
r0 := DatabaseTable: ibis_cache_7s6eudf4vbfqzhs7dpm52nqafu
  year               int16
  date               date
  state_po           string
  county_name        string
  county_fips        string
  jurisdiction_name  string
  jurisdiction_fips  string
  district           string
  office             string
  magnitude          int64
  special            boolean
  stage              string
  precinct           string
  writein            boolean
  candidate          string
  party_detailed     string
  mode               string
  votes              string
  readme_check       boolean
  county_fips2       string
  jurisdiction_fips2 string

Selection[r0]
  predicates:
    r0.votes != '*'
    Cast(r0.votes, to=int64) > 0

In [19]: print(t.filter(_.votes != "*").filter(_.votes.cast("int64") > 0).compile())
SELECT t0.year, t0.date, t0.state_po, t0.county_name, t0.county_fips, t0.jurisdiction_name, t0.jurisdiction_fips, t0.district, t0.office, t0.magnitude, t0.special, t0.stage, t0.precinct, t0.writein, t0.candidate, t0.party_detailed, t0.mode, t0.votes, t0.readme_check, t0.county_fips2, t0.jurisdiction_fips2
FROM ibis_cache_7s6eudf4vbfqzhs7dpm52nqafu AS t0
WHERE t0.votes != :param_1 AND CAST(t0.votes AS BIGINT) > CAST(:param_2 AS INTEGER)

from ibis.

cpcloud avatar cpcloud commented on May 25, 2024

I don't think that matters too much, given that this is incorrect behavior given what the user has written.

I'm still not understanding what the rationale is for keeping the merging given all the issues we have encountered so far.

from ibis.

kszucs avatar kszucs commented on May 25, 2024
  1. I still think that merging is correct. For example polars does the same thing (and I assume other systems as well):
In [1]: import polars as pl
   ...:
   ...: lf = pl.LazyFrame(
   ...:     {
   ...:         "foo": [1, 2, 3],
   ...:         "bar": [6, 7, 8],
   ...:         "ham": ["a", "b", "c"],
   ...:     }
   ...: )
   ...: hamint = pl.col("ham").cast(pl.Int64)
   ...: expr = lf.filter(pl.col("foo") > 1).filter(hamint > 2)
   ...:
   ...: print(expr.explain())
DF ["foo", "bar", "ham"]; PROJECT */3 COLUMNS; SELECTION: "[([(col(\"ham\").strict_cast(Int64)) > (2)]) & ([(col(\"foo\")) > (1)])]"
  1. Even if we consider it as an incorrect functionality, it is not a regression.
  2. I tried disabling the merge in #9066 and there are new issues arising from it, especially with mssql, so we need to implement workaround rewrites for those cases.

from ibis.

NickCrews avatar NickCrews commented on May 25, 2024

_.x = "uncastable"

I'm not sure I understand, can you give a more concrete example?

from ibis.

NickCrews avatar NickCrews commented on May 25, 2024

Consider

def load():
    ...
    t = t.distinct()
    t = t.filter(t.votes != "*")
    return t

# many lines

t = load()
t = t.filter(t.votes.cast(int) > 0)
  1. If you look at each part of this program locally, it appears correct. It only breaks if these two distant parts of the program interact in an unlucky way.
  2. If you switched the order of the .distinct() to be the last statement in load(), it would succeed, even though semantically that yields the same data out of load(). Correctness hinging upon an implementation detail like this smells bad.

from ibis.

kszucs avatar kszucs commented on May 25, 2024

If you switched the order of the .distinct() to be the last statement in load(), it would succeed

Wouldn't it only succeed if the votes column contains valid integer strings or "*"? What I am trying to say that it is entirely data dependent and t.votes.cast(int) is fallible hence the query may or may not fail.

Do you have an example where the filters are not fallible end the merge alters the behaviour?

from ibis.

NickCrews avatar NickCrews commented on May 25, 2024

Wouldn't it only succeed if the votes column contains valid integer strings or "*"?

Yes, that is the scenario I am considering here, the data only contains valid ints or "*". In that scenario:

  • t.filter(_.votes != "*").distinct().filter(_.votes.cast(int) > 0): This SHOULD never error, and in the current implementation, it never errors.
  • t.distinct().filter(_.votes != "*").filter(_.votes.cast(int) > 0): again, this SHOULD never error (per my above argument of "things that look locally correct should be globally correct" argument, and phillip's "spell things out bit by bit" argument), but in the current implementation, it sometimes errors.

from ibis.

kszucs avatar kszucs commented on May 25, 2024

things that look locally correct should be globally correct

The second part of your example looks fallible to me, so no surprise that I am getting a cast error since I cannot have any knowledge about what load does but I do know that t.votes is a string column.

t = load()
t = t.filter(t.votes.cast(int) > 0)

Apparently this only appears to be an issue for duckdb. At least I am unable to reproduce the problem on other backends which suggest that our .cast() behaviour if fairly inconsistent.

I would still like to have an example where the problem occurs without calling .cast().

from ibis.

cpcloud avatar cpcloud commented on May 25, 2024

Reposting my comment from #9065 (comment)

I took the weekend to think about some of the issues here and did a bit of research on the SQL standard specification of predicate short-circuiting.

  • According to this SO Q&A the SQL standard does not specify the evaluation order of predicates. That means that an engine is free to reorder (or not) predicates in any way AFAICT, including if it would change the semantics of the query.

For example, an engine can evaluate a fallible expression first, even if a subsequent expression with short circuiting would avoid the fallible expression evaluation failing.

In practice, DuckDB merges a fallible cast with an infallible operation in the same way producing the exact same plan regardless of how the SQL is specified:

D explain select * from (select * from t where cast(x as int) != 0) where x <> '*';

┌─────────────────────────────┐
│┌───────────────────────────┐│
││       Physical Plan       ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│           FILTER          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│ ((x != '*') AND (CAST(x AS│
│       INTEGER) != 0))     │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           EC: 1           │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         SEQ_SCAN          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             t             │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             x             │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           EC: 5           │
└───────────────────────────┘
D explain select * from (select * from t where x <> '*') where cast(x as int) != 0;

┌─────────────────────────────┐
│┌───────────────────────────┐│
││       Physical Plan       ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│           FILTER          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│ ((x != '*') AND (CAST(x AS│
│       INTEGER) != 0))     │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           EC: 1           │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         SEQ_SCAN          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             t             │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             x             │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           EC: 5           │
└───────────────────────────┘

This means that even if Ibis decided to generate subqueries that map one-to-one to SQL selects it provides absolutely zero guarantee that generating the SQL in that way produces specific behavior.

IMO this means that Ibis should be free to merge predicates, assuming our implementation of that process is correct.

Now, I'm not trying to gloss over the fact that we have had a number of issues with select merging, and so I think we should provide a way to disable it if it can potentially help unblock users.

I think the opt-in behavior would be kind of pointless, since people will never opt-in without lots of documentation and if we're going to go that route we should disable it.

On the other hand, if we think this is long-term a thing we want enabled, then we definitely need feedback on it from users. The only way to get that feedback AFAICT is to enable this feature.

That does mean we have to accept that the current implementation may not be 100% finished and will elicit bug reports.

The opt-out behavior IMO seems like an acceptable balance here: people can disable select merging on an as-needed basis, and we can continuously evaluate whether keeping it around is causing overhead that isn't worth it.

Disabling select merging it turns out is also not without issue: SQLite fails due to a query we generate being too large for the parser stack, and MS SQL loses the ability to specify order by in many cases as well as LIKE functionality in certain positions.

from ibis.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.