Git Product home page Git Product logo

Comments (7)

jayzhan211 avatar jayzhan211 commented on August 15, 2024 1

Have custom logic for unions that looks up the child array to determine if the value is null

+1 for second option. I think we should check the children's nullability.

from arrow-datafusion.

samuelcolvin avatar samuelcolvin commented on August 15, 2024 1

I've proposed a fix in #11321.

from arrow-datafusion.

samuelcolvin avatar samuelcolvin commented on August 15, 2024

@alamb any idea on where I would start looking to try and fix this?

from arrow-datafusion.

alamb avatar alamb commented on August 15, 2024

I would suggest writing a standlone test case / reproducer as the first step

Then I suspect we can either help you find the code needed to be fixed (or maybe even someone would be interested in fixing it themselves)

from arrow-datafusion.

samuelcolvin avatar samuelcolvin commented on August 15, 2024

See #11314 as a demonstration of the problem for both dense and sparse unions.

After a bit of investigation, the issues lies in the first instance with

fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue> {
let arg = self.arg.evaluate(batch)?;
match arg {
ColumnarValue::Array(array) => Ok(ColumnarValue::Array(Arc::new(
compute::is_null(array.as_ref())?,
))),
ColumnarValue::Scalar(scalar) => Ok(ColumnarValue::Scalar(
ScalarValue::Boolean(Some(scalar.is_null())),
)),
}
}

Then with this code in arrow-rs:

/// Returns a non-null [BooleanArray] with whether each value of the array is null.
/// # Error
/// This function never errors.
/// # Example
/// ...
pub fn is_null(input: &dyn Array) -> Result<BooleanArray, ArrowError> {
    let values = match input.logical_nulls() {
        None => BooleanBuffer::new_unset(input.len()),
        Some(nulls) => !nulls.inner(),
    };

    Ok(BooleanArray::new(values, None))
}

And then with this code

    /// Union types always return non null as there is no validity buffer.
    /// To check validity correctly you must check the underlying vector.
    fn is_null(&self, _index: usize) -> bool {
        false
    }

Ultimately with the spec

Unlike other data types, unions do not have their own validity bitmap. Instead,
the nullness of each slot is determined exclusively by the child arrays which
are composed to create the union.


Basically arrow is saying "we're not going to tell you if a union is null, you need to look in the child arrays", but datafusion isn't listening and is just asking the union if it's null in the naive way.

Two options to move forward as far as I can tell:

  1. Decide unions in DF can never be null — I'll need to abandon unions in datafusion-functions-json and just return strings everywhere
  2. Have custom logic for unions that looks up the child array to determine if the value is null

If (as I hope) we go for the second option, there's also the issue (as demonstrated by #11314) that the representation of "null" union items doesn't match other types, it shows {A=} instead of an empty string.

from arrow-datafusion.

samuelcolvin avatar samuelcolvin commented on August 15, 2024

I suppose there's a third option of updating arrow-rs to correctly calculate if a UnionArray is null, but I presume that works take much longer

from arrow-datafusion.

alamb avatar alamb commented on August 15, 2024

I suppose there's a third option of updating arrow-rs to correctly calculate if a UnionArray is null, but I presume that works take much longer

It would likely take longer

Note there is a method that takes into account child nullability that perhaps we could use instead of is_null:
https://docs.rs/arrow/latest/arrow/array/trait.Array.html#method.logical_nulls

Update: it turns out logical_nulls is incorrect for UnionArray: apache/arrow-rs#6017

from arrow-datafusion.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.