Is your feature request related to a problem or challenge? Part of

I'll take this one, can you assign me <a class="user-mention notranslate"

I'll take this one, can you assign me <a class="user-mention notranslate" data-hoverca

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

use StringViewArray when reading String columns from Parquet about arrow-datafusion HOT 11 OPEN

alamb commented on August 26, 2024

use StringViewArray when reading String columns from Parquet

from arrow-datafusion.

Comments (11)

XiangpengHao commented on August 26, 2024 2

I have manually checked that there isn't any unexpected conversion.

The logical plan is

Projection: t.c1, t.c2 [c1:Utf8View;N, c2:Utf8;N]
  Filter: t.c2 != Utf8("small") [c1:Utf8View;N, c2:Utf8;N]
    TableScan: t [c1:Utf8View;N, c2:Utf8;N]

The physical plan is

CoalesceBatchesExec: target_batch_size=8192
  FilterExec: c2@1 != small
    ParquetExec: file_groups={1 group: [[tmp/.tmp4ADQEP/parquet_test/part-0.parquet]]}, projection=[c1, c2], predicate=c2@1 != small, pruning_predicate=CASE WHEN c2_null_count@2 = c2_row_count@3 THEN false ELSE c2_min@0 != small OR small != c2_max@1 END, required_guarantees=[c2 not in (small)]

I looked at the filter exec, it will eventually call into: https://github.com/XiangpengHao/datafusion/blob/string-view/datafusion/physical-expr/src/expressions/binary.rs#L260

Which calls compare_op from arrow.

I'll add more tests to demonstrate that the filter works out of the box.

from arrow-datafusion.

alamb commented on August 26, 2024 1

I'll take this one, can you assign me @alamb ?

BTW you can assign yourself (single word comment take): https://datafusion.apache.org/contributor-guide/index.html#finding-and-creating-issues-to-work-on

from arrow-datafusion.

XiangpengHao commented on August 26, 2024 1

I like the idea of using an optimizer rule to optimistically/optionally use StringView!

from arrow-datafusion.

alamb commented on August 26, 2024 1

BTW the more I think about this the more I think it should probably be a PhysicalOptimizer rule (not a logical optimizer rule) as the change is related to the particular properties of the ExecutionPlans (rather than the logical type)

https://docs.rs/datafusion/latest/datafusion/physical_optimizer/optimizer/trait.PhysicalOptimizerRule.html

from arrow-datafusion.

jayzhan211 commented on August 26, 2024 1

Without changing schema, we need to convert StringArray to StringViewArray -> compute -> convert back to StringArray. It would be nice if the conversion is negligible

In my case #9403 , we need to

Convert to StringViewArray and introduce CastExpr to keep the schema unchanged in physical optimziation
Run GroupStream with StringViewArray
Apply CastExpr to get the StringArray back.

And, the cost of these 3 should always beat with processing with StringArray directly regardless of what kind of second computation is.

from arrow-datafusion.

XiangpengHao commented on August 26, 2024

I'll take this one, can you assign me @alamb ?

from arrow-datafusion.

XiangpengHao commented on August 26, 2024

It seems that the current filter and parquet exec play nicely, and the following code will directly filter on the string view array (instead of converting to StringArrary), which is quite nice.

(the test case requires the latest arrow-rs to run)

#[tokio::test]
async fn parquet_read_filter_string_view() {
    let tmp_dir = TempDir::new().unwrap();

    let values = vec![Some("small"), None, Some("Larger than 12 bytes array")];
    let c1: ArrayRef = Arc::new(StringViewArray::from_iter(values.iter()));
    let c2: ArrayRef = Arc::new(StringArray::from_iter(values.iter()));

    let batch =
        RecordBatch::try_from_iter(vec![("c1", c1.clone()), ("c2", c2.clone())]).unwrap();

    let file_name = {
        let table_dir = tmp_dir.path().join("parquet_test");
        std::fs::create_dir(&table_dir).unwrap();
        let file_name = table_dir.join("part-0.parquet");
        let mut writer = ArrowWriter::try_new(
            fs::File::create(&file_name).unwrap(),
            batch.schema(),
            None,
        )
        .unwrap();
        writer.write(&batch).unwrap();
        writer.close().unwrap();
        file_name
    };

    let ctx = SessionContext::new();
    ctx.register_parquet("t", file_name.to_str().unwrap(), Default::default())
        .await
        .unwrap();

    async fn display_result(sql: &str, ctx: &SessionContext) {
        let result = ctx.sql(sql).await.unwrap().collect().await.unwrap();

        arrow::util::pretty::print_batches(&result).unwrap();

        for b in result {
            println!("schema: {:?}", b.schema());
        }
    }

    display_result("SELECT * from t", &ctx).await;
    display_result("SELECT * from t where c1 <> 'small'", &ctx).await;
    display_result("SELECT * from t where c2 <> 'small'", &ctx).await;
}

I'll take a closer look at the generated logical/physical plans to verify that the string view array is never being converted to string array. And if that is the case, the remaining work of this issue is probably (1) add an option to force reading StringArray as StringView array, and (2) add more tests and potentially test the generated plan uses StringViewArray consistently.

from arrow-datafusion.

alamb commented on August 26, 2024

@jayzhan211 and I are talking about something similar in #9403 (comment)

What do you think about potentially adding a new OptimizerRule, something like

struct UseStringView {}

/// Optimizer rule that rewrites portions of the Plan to use `StringViewArray` instead of 
/// `StringArray` where the operators support the new type
///
/// (some background on StringArray and why it is better for some operators)
///
/// This rule currently supports:
/// 1. Reading strings from ParquetExec (which can save a copy of the string)
/// 2. GroupBy operation
/// ...
impl OptimzierRule for UseViews { 
...
}

from arrow-datafusion.

jayzhan211 commented on August 26, 2024

@jayzhan211 and I are talking about something similar in #9403 (comment)

What do you think about potentially adding a new OptimizerRule, something like

struct UseStringView {}

/// Optimizer rule that rewrites portions of the Plan to use `StringViewArray` instead of 
/// `StringArray` where the operators support the new type
///
/// (some background on StringArray and why it is better for some operators)
///
/// This rule currently supports:
/// 1. Reading strings from ParquetExec (which can save a copy of the string)
/// 2. GroupBy operation
/// ...
impl OptimzierRule for UseViews { 
...
}

We usually ensure the schema remain unchanged after applying the optimize rule, should this be the special case?

datafusion/datafusion/optimizer/src/optimizer.rs

Lines 389 to 393 in 4109f58

 // verify the rule didn't change the schema 

 .and_then(|tnr| { 

 assert_schema_is_the_same(rule.name(), &starting_schema, &tnr.data)?; 

 Ok(tnr) 

 });

from arrow-datafusion.

alamb commented on August 26, 2024

We usually ensure the schema remain unchanged after applying the optimize rule, should this be the special case?

I think we need to keep the schema the same (that is a pretty far reaching invariant)

But maybe we could do something like add a projection to convert it bac

So like if the input plan was

FilterExec [Utf8]
  ParquetExec[Utf8]

We could implement an optimizer rule that made a plan like

ProjectionExec(cast(col) as Utf8) [utf8]
  Filter[Utf8View]
    ParquetExec[Utf8View]

🤔

from arrow-datafusion.

XiangpengHao commented on August 26, 2024

BTW the more I think about this the more I think it should probably be a PhysicalOptimizer rule (not a logical optimizer rule) as the change is related to the particular properties of the ExecutionPlans (rather than the logical type)

https://docs.rs/datafusion/latest/datafusion/physical_optimizer/optimizer/trait.PhysicalOptimizerRule.html

I'll try to prototype a physical optimizer

from arrow-datafusion.

use StringViewArray when reading String columns from Parquet about arrow-datafusion HOT 11 OPEN

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	// verify the rule didn't change the schema
	.and_then(\|tnr\| {
	assert_schema_is_the_same(rule.name(), &starting_schema, &tnr.data)?;
	Ok(tnr)
	});