Is your feature request related to a problem or challenge? During

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I have read the context now and understand that this is about <code class="notranslate

Feedback request for providing configurable UDF functions about arrow-datafusion HOT 12 OPEN

Omega359 commented on July 18, 2024

Feedback request for providing configurable UDF functions

from arrow-datafusion.

Comments (12)

Omega359 commented on July 18, 2024 1

The one issue with moving this functionality into a spark module is that for that to really be valid the formats would have to be spark compatible, which they are not currently. I do not have the spare time in the near future to implement a parser to do that.

from arrow-datafusion.

alamb commented on July 18, 2024

I think options 1 and 3 would be straightforward

You could even potentially implement

pub fn to_timestamp_safe(args: Vec<Expr>) -> Expr {
...
}

Directly in your application (rather than in the core of datafusion)

Another crazy thought might be to implement a rewrite pass (e.g. AnalyzerRule) that rewrites all expressions in the plan when safe mode is needed... I think they have access to all the state necessary

from arrow-datafusion.

alamb commented on July 18, 2024

I think the key thing to figure out is "will safemode to_timestamp be part of the datafusion core"?

Maybe it is time to make a datafusion-functions-scalar-spark or something that has functions that have the spark behaviors 🤔

from arrow-datafusion.

jayzhan211 commented on July 18, 2024

I think it is possible to extend the safe_mode to the datafusion core like what you mentioned, it should be similar to the distinct mode to the aggregate function. We can have different behavior based on whether safe_mode is set.

For the expression API, we can either

Introduce to_timestamp_safe()
Extend to_timestamp(args, safe: bool)
Introduce builder mode to avoid breaking change to_timestamp(args).safe().build()

I prefer the third one.

Also, there are many to_timestamp functions with different time units. I'm not sure why they are split into different functions, they are possible to collapse into a single to_timestamp function and cast to different time units based on the given argument.

We can have to_timestamp(args).time_unit(Mili).safe().build() if we need the timestamp millisecond and to_timestamp(args).safe().build() for Nanosecond

"will safemode to_timestamp be part of the datafusion core"

It is an interesting question, we can think of implementing functions based on other DB in the first place.

For example, we usually follow postgres, duckdb, and others.

We can have functions-postgres, functions-duckdb and functions-spark that aim to mirror the behavior of other db, and we don't even need functions crate (we can keep them for backward compatibility). They are considered as the extension crate implemented by the third party, datafusion does not need to implement any datafusion-specific function (and we prefer not to), we just make sure datafusion core is possible compatible with different functions crate. And we can register most-used functions to the datafusion for the end-to-end SQL workflow!

from arrow-datafusion.

jayzhan211 commented on July 18, 2024

We can have functions-postgres, functions-duckdb and functions-spark

Most of the function has the same behavior in different db, we can also implement different functions in one crate functions, but we register the expected functions based on the DB we want to mimic. Instead of having a default function for datafusion SQL workflow, we can have spark_functions, and postgres_functions that register different to_timestamp functions based on the configuration we set.

from arrow-datafusion.

Omega359 commented on July 18, 2024

We can have functions-postgres, functions-duckdb and functions-spark

Most of the function has the same behavior in different db, we can also implement different functions in one crate functions, but we register the expected functions based on the DB we want to mimic. Instead of having a default function for datafusion SQL workflow, we can have spark_functions, and postgres_functions that register different to_timestamp functions based on the configuration we set.

@andygrove are there udf's already in the comet project that handle spark specific behaviour? If so is that a separate project or embedded in comet currently? (I haven't looked at that codebase myself since the initial drop)

from arrow-datafusion.

andygrove commented on July 18, 2024

@Omega359 so far we have been implementing custom PhysicalExpr directly in the datafusion-comet project as needed for Spark-specific behavior, with support for Spark's different evaluation modes (legacy, try, ansi) and we are using fuzz testing to ensure compatibilty across multiple Spark versions.

I think we need to have the discussion of whether it makes sense to upstream these into the core datafusion project or not, or whether we publish a datafusion-spark-compat crate from Comet, or some other option.

from arrow-datafusion.

andygrove commented on July 18, 2024

The one issue with moving this functionality into a spark module is that for that to really be valid the formats would have to be spark compatible, which they are not currently. I do not have the spare time in the near future to implement a parser to do that.

We are porting Spark parsing logic as part of Comet.

from arrow-datafusion.

Omega359 commented on July 18, 2024

I think we need to have the discussion of whether it makes sense to upstream these into the core datafusion project or not, or whether we publish a datafusion-spark-compat crate from Comet, or some other option.

Thank you for chiming in. While I wouldn't mind spark compatibility it really isn't the focus of this request as I've already converted all the spark expressions and function usages to DF compatible ones. It's the general system behaviour that is what I would like to address - being able to essentially switch from a db focused perspective (fail fast) to a processing engine one (nominally lenient - return null) for some (all) of the UDF's.

If the general consensus is to separate out this desired behaviour than I would think a separate crate might be the best approach. However from searching the issues here there seems to have been some talk of how to handle mirroring the behaviour of other databases in the past but it also includes sql syntax as well so it's not quite as simple as just having a db specific crate full of UDF's and calling it a day.

from arrow-datafusion.

andygrove commented on July 18, 2024

I have read the context now and understand that this is about safe mode or what Spark calls ANSI mode.

Isn't this just a case of adding a new flag to the session context that UDFs can choose to use when deciding whether to return null or throw an error?

from arrow-datafusion.

Omega359 commented on July 18, 2024

I have read the context now and understand that this is about safe mode or what Spark calls ANSI mode.

Isn't this just a case of adding a new flag to the session context that UDFs can choose to use when deciding whether to return null or throw an error?

That would be nice ... except UDF's don't have a way to access the session context currently :( Option #2 and #3 provide that via different mechanisms.

from arrow-datafusion.

alamb commented on July 18, 2024

I wonder if we could take a page from what @jayzhan211 is implementing in #10560 and go with a trait

So we could implement something like

let expr = to_timestamp(lit("2021-01-01"))
  // set the to_timestamp mode to "safe" 
  .safe();

I realize that this would require changing the callsites so maybe it isn't viable

from arrow-datafusion.

Feedback request for providing configurable UDF functions about arrow-datafusion HOT 12 OPEN

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent