Comments (13)
Will work on this one. I think it might also involve moving nth_value
since they shared some code.
from datafusion.
@jayzhan211 I pushed a work in progress here #11029 it still fails some test cases in sqllogictests
External error: query failed: DataFusion error: This feature is not implemented: Aggregate can not be used as a sliding accumulator because `retract_batch` is not implemented: NTH_VALUE(aggregate_test_100.c4,Int64(3)) ORDER BY [aggregate_test_100.c9 ASC NULLS LAST] ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING
[SQL] SELECT
NTH_VALUE(c4, 3) OVER(ORDER BY c9 ASC ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING) as nth_value1,
NTH_VALUE(c4, 2) OVER(ORDER BY c9 ASC ROWS BETWEEN 1 PRECEDING AND 3 FOLLOWING) as nth_value2
FROM aggregate_test_100
ORDER BY c9
LIMIT 5
at test_files/window.slt:1205
External error: query failed: DataFusion error: External error: Arrow error: Invalid argument error: column types must match schema types, expected List(Field { name: "item", data_type: Struct([Field { name: "sn@1", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) but found List(Field { name: "item", data_type: Struct([Field { name: "sn@0", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) at column index 2
[SQL] SELECT ARRAY_AGG(e.rate ORDER BY e.sn)
FROM sales_global AS s
JOIN exchange_rates AS e
ON s.currency = e.currency_from AND
e.currency_to = 'USD' AND
s.ts >= e.ts
GROUP BY s.sn
ORDER BY s.sn;
at test_files/group_by.slt:3181
The first one seems to be because the new NTH_VALUE
UDAF is picked over the builtin window function with the same name. Is this expected? What is the correct course of action to resolve it?
The second one looks a bit weird to me, not sure if I messed something up or I hitting some other issue.
@jayzhan211 If you have time to provide some pointers that would be highly appreciated :)
from datafusion.
@jayzhan211 I will not be able to work on it for the two weeks due to being on vacation. So, someone else should feel to pick it up/take it over before then.
from datafusion.
I suggest we convert 1. ArrayAgg 2 DistinctArrayAgg, 3. OrderSensitiveArrayAgg and 3. NthValue separately.
OrderSensitiveArrayAgg
and NthValue
are quite complex.
from datafusion.
ArrayAgg and Nth expect to have parameter nullable
, we need to add it to AggregateFunctionExpr
so we can add it to StateFieldsArgs
for state_fields
and change the nullable
for field
.
We can get nullable
with
from datafusion.
I suggest we convert ArrayAgg, DistinctArrayAgg, OrderSensitiveArrayAgg and NthValue separately.
At least leave
OrderSensitiveArrayAgg
andNthValue
in another PR, since they have ordering and window function so it is quite complex
Sounds good. If only converting ArrayAgg
how do one handle that there are multiple expressions using the same name? Should it not be registered? Or should I just leave the existing ArrayAgg code as is? The accumulator is private so it can not easily be reused, or is it acceptable to make it public in this intermediate state?
from datafusion.
ArrayAgg and Nth expect to have parameter
nullable
, we need to add it toAggregateFunctionExpr
so we can add it toStateFieldsArgs
forstate_fields
and change thenullable
forfield
.
Make sense. How come we only provide a single value for input_type
(https://github.com/apache/datafusion/blob/39.0.0/datafusion/physical-expr-common/src/aggregate/mod.rs#L245) can aggregates not have multiple inputs? Should the nullable field be just input_nullable: bool
or should it be inputs_nullable: Vec<bool>
?
from datafusion.
ArrayAgg and Nth expect to have parameter
nullable
, we need to add it toAggregateFunctionExpr
so we can add it toStateFieldsArgs
forstate_fields
and change thenullable
forfield
.Make sense. How come we only provide a single value for
input_type
(https://github.com/apache/datafusion/blob/39.0.0/datafusion/physical-expr-common/src/aggregate/mod.rs#L245) can aggregates not have multiple inputs? Should the nullable field be justinput_nullable: bool
or should it beinputs_nullable: Vec<bool>
?
We have single input because we have not meet any function that need multiple input yet. If there is any function that expect multiple input, we can extend it to Vec
from datafusion.
I suggest we convert ArrayAgg, DistinctArrayAgg, OrderSensitiveArrayAgg and NthValue separately.
At least leaveOrderSensitiveArrayAgg
andNthValue
in another PR, since they have ordering and window function so it is quite complexSounds good. If only converting
ArrayAgg
how do one handle that there are multiple expressions using the same name? Should it not be registered? Or should I just leave the existing ArrayAgg code as is? The accumulator is private so it can not easily be reused, or is it acceptable to make it public in this intermediate state?
We could check the distinct
and ordering
to know whether we should use builtin or UDAF here
datafusion/datafusion/core/src/physical_planner.rs
Lines 1825 to 1909 in 18042fd
The accumulator is private so it can not easily be reused, or is it acceptable to make it public in this intermediate state?
We can move tophysical_expr_common
first, after all the related function is done, then move it back.
from datafusion.
We have single input because we have not meet any function that need multiple input yet. If there is any function that expect multiple input, we can extend it to Vec
What about covariance: https://github.com/apache/datafusion/blob/main/datafusion/functions-aggregate/src/covariance.rs#L43 that takes 2 arguments.
from datafusion.
@jayzhan211 Created a PR for only doning ArrayAgg here #11045 will look into adding nullable
next.
from datafusion.
array_agg is known to produce non-null result.
if we make it an UDAF, to avoid regression (if we choose so), we need a way for an UDAF to mark its output as non-null: #11274
from datafusion.
@eejbyfeldt Dp you plan to work on array_agg
?
from datafusion.
Related Issues (20)
- Reduce repetition in `try_process_group_by_unnest` and `try_process_unnest`
- Investigate memory use in debug builds for deeply nested array constants
- [EPIC] Extract remaining physical optimizer out of core
- Leverage dictionary-encode when turning a scalar columnar value into an array HOT 2
- [Proposal] Decouple logical from physical types HOT 5
- Chore: fix typos in the code
- Explore Updating VariadicAny Signature to take 0 Args HOT 2
- Resources exhuasted errors are confusing return the biggest memory consumers. HOT 3
- `SHOW NONSENSE` does not error
- chore: Add `SessionState` to `MockContextProvider` just like `SessionContextProvider` HOT 1
- The HashJoin and NestedLoopJoin gives different results for filtered joins fuzz tests. HOT 1
- Add fuzz tests for SortMergeJoin spilling
- ExprBuilder for Physical Aggregate Expr
- Use SimpleExtensions for Substrait type variations
- Provide valid extensionUris and extensionUriReferences in Substrait
- Easier Dataframe API for `map` HOT 1
- Crash bug when `log()` is used in `order by` clause (SQLancer)
- Parsing SQL strings to Exprs wtih the qualified schema
- Query with `order by acos(sin(v1))` panic (SQLancer) HOT 1
- Add reservoir sampling
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from datafusion.