Comments (9)
I am pretty sure from our (InfluxData)'s perspective, the prepared statement usecase is not much of a priority (especially compared to making planning overall faster), so @appletreeisyellow likely can't spend a lot of time on this issue (though maybe she feels differently).
+1. I'm align with @alamb's comment and leave this issue open for anyone else who wants to improve
from arrow-datafusion.
Yes!
from arrow-datafusion.
Thank you @simonvandel -- that makes total sense.
FWIW the physical optimizer passes to create an Execution Plan still do non trivial work even after the LogicalPlan is created.
I agree this usecase is reasonable one where running the optimizer takes non trivial time compared to query execution time. In fact this was the original usecase for parameterized queries in OLTP engines where the cost of planning dominated the cost of actually running the query so reusing a prepared statement was an important optimization
The usecase is much less common in classic analytic systems as the query execution time was often so much more than even 10s of ms of planning time.
However, as analytics is pushed everywhere, planning time is more important I think the fact that the DataFusion optimizer is so slow makes this even more pronounced. Ergo I think making planning faster via #5637 is very important
I am pretty sure from our (InfluxData)'s perspective, the prepared statement usecase is not much of a priority (especially compared to making planning overall faster), so @appletreeisyellow likely can't spend a lot of time on this issue (though maybe she feels differently).
Thus, what I suggest is we polish up #9073 which improves the error messages and then we can leave this particular ticket open for anyone else who might have the usecase and wants to improve it
from arrow-datafusion.
I agree this would be good to fix -- cc @kallisti-dev who I think is working on something related in IOx
from arrow-datafusion.
@appletreeisyellow are you willing to look at this ticket?
from arrow-datafusion.
Note there is a function that infers placeholder types: https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.Expr.html#method.infer_placeholder_types
Maybe we need to call it prior to the optimizer being run 🤔
from arrow-datafusion.
Maybe this was fixed in #8977 by @kallisti-dev
from arrow-datafusion.
Hi @simonvandel -- I wonder if you can provide some context about your expected behavior
The optimizer does not fail at the placeholder, but optimizes as much as it can with the info it has.
It looks like the change in #9073 from @appletreeisyellow assumes the flow is:
- Create a logical plan
- Provide the parameters
- Run the optimizer
The description on this ticket implies your flow would be
- Create the logical plan
- Run the optimizer
- Provide the parameters
Is this correct? Is there a reason you want to run the optimizer prior to providing parameter values?
The reason I ask is that depending on what you are trying to do the changes/solution may be different.
from arrow-datafusion.
Hi @alamb
I sometimes have workloads where the time taken to plan a query dominates the time to actually perform the query. This can happen in the trivial case where the TableProviders returned no data.
Let's say that it's the same logical plan being used for every workload, just with different parameters. So then, my idea was to generate a logical plan with parameter placeholders, optimize that, and cache it. Then in each actual request, pull the optimized logical plan, and replace the placeholders with actual values.
The idea was that optimizing the final logical plan where the placeholders were replaced, would be very fast. This might have been naive.
So short story short, I wanted to cache optimized logical plans so they can be reused with different parameters.
Note that the optimizer is now quite a bit faster with the latest PRs, so the optimization time might not be a problem anymore. But I can still imagine the use case being relevant.
from arrow-datafusion.
Related Issues (20)
- Regression: incorrect result for aliased arguments in select list after TreeNode rewrite HOT 2
- Query hangs on collecting stream from recursive CTE HOT 4
- Release DataFusion 37.0.0 HOT 19
- `to_parquet` with path not ending in a slash writes to a file instead of a directory since v36 HOT 2
- [EPIC] Tasks for a new Top Level Apache Project HOT 29
- Rename `common_runtime` directory to `common-runtime` for consistency with other crate directory names HOT 1
- Refine Serde for Scalar Function HOT 1
- Move lower, octet_length to datafusion-functions HOT 1
- Move btrim, ltrim and rtrim to datafusion-functions HOT 1
- Extract `range` and `gen_series` from `functions-array` subcrate' s `kernels` and `udf` containers HOT 1
- `array_union` and `array_intersect` cannot handle NULL columnar data HOT 1
- COPY TO fails on passing options for format through cli HOT 2
- Regression: Can no longer use `FORMAT PARQUET` in COPY command HOT 1
- Can not handle `'` characters in PARTITIONED BY clause HOT 3
- Regression: All formatting options in COPY commands require `format.` prefix, but did not in DataFusion 36.0.0 HOT 2
- Improve to_field performance by find field's data_type and nullable in dfschema in one time HOT 1
- CI fails with latest rustup updates.
- Complete support for `Expr --> String ` HOT 29
- Substrait serializer clippy error: not calling `truncate`
- adding Expr->String for IsTrue, IsFalse, IsUnknown
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from arrow-datafusion.