Comments (7)
Have custom logic for unions that looks up the child array to determine if the value is null
+1 for second option. I think we should check the children's nullability.
from arrow-datafusion.
I've proposed a fix in #11321.
from arrow-datafusion.
@alamb any idea on where I would start looking to try and fix this?
from arrow-datafusion.
I would suggest writing a standlone test case / reproducer as the first step
Then I suspect we can either help you find the code needed to be fixed (or maybe even someone would be interested in fixing it themselves)
from arrow-datafusion.
See #11314 as a demonstration of the problem for both dense and sparse unions.
After a bit of investigation, the issues lies in the first instance with
datafusion/datafusion/physical-expr/src/expressions/is_null.rs
Lines 74 to 84 in 08c5345
Then with this code in arrow-rs
:
/// Returns a non-null [BooleanArray] with whether each value of the array is null.
/// # Error
/// This function never errors.
/// # Example
/// ...
pub fn is_null(input: &dyn Array) -> Result<BooleanArray, ArrowError> {
let values = match input.logical_nulls() {
None => BooleanBuffer::new_unset(input.len()),
Some(nulls) => !nulls.inner(),
};
Ok(BooleanArray::new(values, None))
}
And then with this code
/// Union types always return non null as there is no validity buffer.
/// To check validity correctly you must check the underlying vector.
fn is_null(&self, _index: usize) -> bool {
false
}
Ultimately with the spec
Unlike other data types, unions do not have their own validity bitmap. Instead,
the nullness of each slot is determined exclusively by the child arrays which
are composed to create the union.
Basically arrow is saying "we're not going to tell you if a union is null, you need to look in the child arrays", but datafusion isn't listening and is just asking the union if it's null in the naive way.
Two options to move forward as far as I can tell:
- Decide unions in DF can never be null — I'll need to abandon unions in
datafusion-functions-json
and just return strings everywhere - Have custom logic for unions that looks up the child array to determine if the value is null
If (as I hope) we go for the second option, there's also the issue (as demonstrated by #11314) that the representation of "null" union items doesn't match other types, it shows {A=}
instead of an empty string.
from arrow-datafusion.
I suppose there's a third option of updating arrow-rs to correctly calculate if a UnionArray
is null, but I presume that works take much longer
from arrow-datafusion.
I suppose there's a third option of updating arrow-rs to correctly calculate if a
UnionArray
is null, but I presume that works take much longer
It would likely take longer
Note there is a method that takes into account child nullability that perhaps we could use instead of is_null
:
https://docs.rs/arrow/latest/arrow/array/trait.Array.html#method.logical_nulls
Update: it turns out logical_nulls
is incorrect for UnionArray: apache/arrow-rs#6017
from arrow-datafusion.
Related Issues (20)
- chore: Add `SessionState` to `MockContextProvider` just like `SessionContextProvider` HOT 1
- The HashJoin and NestedLoopJoin gives different results for filtered joins fuzz tests. HOT 1
- Add fuzz tests for SortMergeJoin spilling
- ExprBuilder for Physical Aggregate Expr
- Use SimpleExtensions for Substrait type variations
- Provide valid extensionUris and extensionUriReferences in Substrait
- Easier Dataframe API for `map` HOT 17
- Crash bug when `log()` is used in `order by` clause (SQLancer)
- Parsing SQL strings to Exprs wtih the qualified schema HOT 1
- Query with `order by acos(sin(v1))` panic (SQLancer) HOT 2
- Add reservoir sampling HOT 1
- Intermittent failures in `fuzz_cases::join_fuzz::test_anti_join_1k_filtered` HOT 3
- signum function incompatible with Postgres and Apache Spark HOT 4
- Internal error when there is a bitwise operation in `order by` clause (SQLancer) HOT 1
- Fix clippy lint for the number of arguments to `CsvExec::new()` HOT 1
- Update ClickBench benchmarks with DataFusion 40 HOT 7
- Building time for `cargo bench` takes quite a long time HOT 1
- Potential optimization for CASE WHEN for protecting against divide by zero HOT 1
- Add a sub-project for map udf functions HOT 5
- always failed test on datasource::file_format::csv::tests::test_csv_parallel_one_col::case_6 on windows machine
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from arrow-datafusion.