Comments (10)
Will work on this one. I think it might also involve moving nth_value
since they shared some code.
from arrow-datafusion.
@jayzhan211 I pushed a work in progress here #11029 it still fails some test cases in sqllogictests
External error: query failed: DataFusion error: This feature is not implemented: Aggregate can not be used as a sliding accumulator because `retract_batch` is not implemented: NTH_VALUE(aggregate_test_100.c4,Int64(3)) ORDER BY [aggregate_test_100.c9 ASC NULLS LAST] ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING
[SQL] SELECT
NTH_VALUE(c4, 3) OVER(ORDER BY c9 ASC ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING) as nth_value1,
NTH_VALUE(c4, 2) OVER(ORDER BY c9 ASC ROWS BETWEEN 1 PRECEDING AND 3 FOLLOWING) as nth_value2
FROM aggregate_test_100
ORDER BY c9
LIMIT 5
at test_files/window.slt:1205
External error: query failed: DataFusion error: External error: Arrow error: Invalid argument error: column types must match schema types, expected List(Field { name: "item", data_type: Struct([Field { name: "sn@1", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) but found List(Field { name: "item", data_type: Struct([Field { name: "sn@0", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) at column index 2
[SQL] SELECT ARRAY_AGG(e.rate ORDER BY e.sn)
FROM sales_global AS s
JOIN exchange_rates AS e
ON s.currency = e.currency_from AND
e.currency_to = 'USD' AND
s.ts >= e.ts
GROUP BY s.sn
ORDER BY s.sn;
at test_files/group_by.slt:3181
The first one seems to be because the new NTH_VALUE
UDAF is picked over the builtin window function with the same name. Is this expected? What is the correct course of action to resolve it?
The second one looks a bit weird to me, not sure if I messed something up or I hitting some other issue.
@jayzhan211 If you have time to provide some pointers that would be highly appreciated :)
from arrow-datafusion.
I suggest we convert 1. ArrayAgg 2 DistinctArrayAgg, 3. OrderSensitiveArrayAgg and 3. NthValue separately.
OrderSensitiveArrayAgg
and NthValue
are quite complex.
from arrow-datafusion.
ArrayAgg and Nth expect to have parameter nullable
, we need to add it to AggregateFunctionExpr
so we can add it to StateFieldsArgs
for state_fields
and change the nullable
for field
.
We can get nullable
with
from arrow-datafusion.
I suggest we convert ArrayAgg, DistinctArrayAgg, OrderSensitiveArrayAgg and NthValue separately.
At least leave
OrderSensitiveArrayAgg
andNthValue
in another PR, since they have ordering and window function so it is quite complex
Sounds good. If only converting ArrayAgg
how do one handle that there are multiple expressions using the same name? Should it not be registered? Or should I just leave the existing ArrayAgg code as is? The accumulator is private so it can not easily be reused, or is it acceptable to make it public in this intermediate state?
from arrow-datafusion.
ArrayAgg and Nth expect to have parameter
nullable
, we need to add it toAggregateFunctionExpr
so we can add it toStateFieldsArgs
forstate_fields
and change thenullable
forfield
.
Make sense. How come we only provide a single value for input_type
(https://github.com/apache/datafusion/blob/39.0.0/datafusion/physical-expr-common/src/aggregate/mod.rs#L245) can aggregates not have multiple inputs? Should the nullable field be just input_nullable: bool
or should it be inputs_nullable: Vec<bool>
?
from arrow-datafusion.
ArrayAgg and Nth expect to have parameter
nullable
, we need to add it toAggregateFunctionExpr
so we can add it toStateFieldsArgs
forstate_fields
and change thenullable
forfield
.Make sense. How come we only provide a single value for
input_type
(https://github.com/apache/datafusion/blob/39.0.0/datafusion/physical-expr-common/src/aggregate/mod.rs#L245) can aggregates not have multiple inputs? Should the nullable field be justinput_nullable: bool
or should it beinputs_nullable: Vec<bool>
?
We have single input because we have not meet any function that need multiple input yet. If there is any function that expect multiple input, we can extend it to Vec
from arrow-datafusion.
I suggest we convert ArrayAgg, DistinctArrayAgg, OrderSensitiveArrayAgg and NthValue separately.
At least leaveOrderSensitiveArrayAgg
andNthValue
in another PR, since they have ordering and window function so it is quite complexSounds good. If only converting
ArrayAgg
how do one handle that there are multiple expressions using the same name? Should it not be registered? Or should I just leave the existing ArrayAgg code as is? The accumulator is private so it can not easily be reused, or is it acceptable to make it public in this intermediate state?
We could check the distinct
and ordering
to know whether we should use builtin or UDAF here
datafusion/datafusion/core/src/physical_planner.rs
Lines 1825 to 1909 in 18042fd
The accumulator is private so it can not easily be reused, or is it acceptable to make it public in this intermediate state?
We can move tophysical_expr_common
first, after all the related function is done, then move it back.
from arrow-datafusion.
We have single input because we have not meet any function that need multiple input yet. If there is any function that expect multiple input, we can extend it to Vec
What about covariance: https://github.com/apache/datafusion/blob/main/datafusion/functions-aggregate/src/covariance.rs#L43 that takes 2 arguments.
from arrow-datafusion.
@jayzhan211 Created a PR for only doning ArrayAgg here #11045 will look into adding nullable
next.
from arrow-datafusion.
Related Issues (20)
- `LCM` panicked due to overflow
- Running tests uses 50.1GB on Ubuntu HOT 5
- DataFusion weekly project plan (Andrew Lamb) - June 24, 2024 HOT 1
- Make Eq with negative float zero compat with other engines HOT 1
- Support `Decimal` and `Decimal256` Parquet Data Page Statistics HOT 1
- Support `Timestamp` Parquet Data Page Statistics HOT 4
- Support `Date` Parquet Data Page Statistics HOT 1
- Support `Time` Parquet Data Page Statistics HOT 3
- regression: query error with case when as the group field HOT 6
- ScalarValue::new_ten error cites one not ten
- Support Map as a ScalarValue HOT 4
- Consider using `Arc::clone` to clone Arcs to make it clear they aren't deep copies HOT 1
- Support unparsing the Value Plan of Array (List) to SQL String HOT 2
- Support `Dictionary` in Parquet Metadata Statistics HOT 4
- Is pre-compile pattern string in regexp_match operation HOT 3
- Implement physical plan serialization for COPY plans `CsvLogicalExtensionCodec` HOT 1
- Remove special casting of `Min` / `Max` built in `AggregateFunctions` HOT 1
- Remove `Min`/`Max` references from `AggregateExec::get_minmax_descr` HOT 1
- Remove `Min`/`Max` references from `AggregateStatistics` HOT 1
- Union columns can never be `NULL` (I think?)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from arrow-datafusion.