Comments (11)
I have manually checked that there isn't any unexpected conversion.
The logical plan is
Projection: t.c1, t.c2 [c1:Utf8View;N, c2:Utf8;N]
Filter: t.c2 != Utf8("small") [c1:Utf8View;N, c2:Utf8;N]
TableScan: t [c1:Utf8View;N, c2:Utf8;N]
The physical plan is
CoalesceBatchesExec: target_batch_size=8192
FilterExec: c2@1 != small
ParquetExec: file_groups={1 group: [[tmp/.tmp4ADQEP/parquet_test/part-0.parquet]]}, projection=[c1, c2], predicate=c2@1 != small, pruning_predicate=CASE WHEN c2_null_count@2 = c2_row_count@3 THEN false ELSE c2_min@0 != small OR small != c2_max@1 END, required_guarantees=[c2 not in (small)]
I looked at the filter exec, it will eventually call into: https://github.com/XiangpengHao/datafusion/blob/string-view/datafusion/physical-expr/src/expressions/binary.rs#L260
Which calls compare_op
from arrow.
I'll add more tests to demonstrate that the filter works out of the box.
from arrow-datafusion.
I'll take this one, can you assign me @alamb ?
BTW you can assign yourself (single word comment take
): https://datafusion.apache.org/contributor-guide/index.html#finding-and-creating-issues-to-work-on
from arrow-datafusion.
I like the idea of using an optimizer rule to optimistically/optionally use StringView!
from arrow-datafusion.
BTW the more I think about this the more I think it should probably be a PhysicalOptimizer rule (not a logical optimizer rule) as the change is related to the particular properties of the ExecutionPlans (rather than the logical type)
from arrow-datafusion.
Without changing schema, we need to convert StringArray to StringViewArray -> compute -> convert back to StringArray. It would be nice if the conversion is negligible
In my case #9403 , we need to
- Convert to StringViewArray and introduce CastExpr to keep the schema unchanged in physical optimziation
- Run GroupStream with StringViewArray
- Apply CastExpr to get the StringArray back.
And, the cost of these 3 should always beat with processing with StringArray directly regardless of what kind of second computation is.
from arrow-datafusion.
I'll take this one, can you assign me @alamb ?
from arrow-datafusion.
It seems that the current filter and parquet exec play nicely, and the following code will directly filter on the string view array (instead of converting to StringArrary
), which is quite nice.
(the test case requires the latest arrow-rs to run)
#[tokio::test]
async fn parquet_read_filter_string_view() {
let tmp_dir = TempDir::new().unwrap();
let values = vec![Some("small"), None, Some("Larger than 12 bytes array")];
let c1: ArrayRef = Arc::new(StringViewArray::from_iter(values.iter()));
let c2: ArrayRef = Arc::new(StringArray::from_iter(values.iter()));
let batch =
RecordBatch::try_from_iter(vec![("c1", c1.clone()), ("c2", c2.clone())]).unwrap();
let file_name = {
let table_dir = tmp_dir.path().join("parquet_test");
std::fs::create_dir(&table_dir).unwrap();
let file_name = table_dir.join("part-0.parquet");
let mut writer = ArrowWriter::try_new(
fs::File::create(&file_name).unwrap(),
batch.schema(),
None,
)
.unwrap();
writer.write(&batch).unwrap();
writer.close().unwrap();
file_name
};
let ctx = SessionContext::new();
ctx.register_parquet("t", file_name.to_str().unwrap(), Default::default())
.await
.unwrap();
async fn display_result(sql: &str, ctx: &SessionContext) {
let result = ctx.sql(sql).await.unwrap().collect().await.unwrap();
arrow::util::pretty::print_batches(&result).unwrap();
for b in result {
println!("schema: {:?}", b.schema());
}
}
display_result("SELECT * from t", &ctx).await;
display_result("SELECT * from t where c1 <> 'small'", &ctx).await;
display_result("SELECT * from t where c2 <> 'small'", &ctx).await;
}
I'll take a closer look at the generated logical/physical plans to verify that the string view array is never being converted to string array. And if that is the case, the remaining work of this issue is probably (1) add an option to force reading StringArray as StringView array, and (2) add more tests and potentially test the generated plan uses StringViewArray consistently.
from arrow-datafusion.
@jayzhan211 and I are talking about something similar in #9403 (comment)
What do you think about potentially adding a new OptimizerRule, something like
struct UseStringView {}
/// Optimizer rule that rewrites portions of the Plan to use `StringViewArray` instead of
/// `StringArray` where the operators support the new type
///
/// (some background on StringArray and why it is better for some operators)
///
/// This rule currently supports:
/// 1. Reading strings from ParquetExec (which can save a copy of the string)
/// 2. GroupBy operation
/// ...
impl OptimzierRule for UseViews {
...
}
from arrow-datafusion.
@jayzhan211 and I are talking about something similar in #9403 (comment)
What do you think about potentially adding a new OptimizerRule, something like
struct UseStringView {} /// Optimizer rule that rewrites portions of the Plan to use `StringViewArray` instead of /// `StringArray` where the operators support the new type /// /// (some background on StringArray and why it is better for some operators) /// /// This rule currently supports: /// 1. Reading strings from ParquetExec (which can save a copy of the string) /// 2. GroupBy operation /// ... impl OptimzierRule for UseViews { ... }
We usually ensure the schema remain unchanged after applying the optimize rule, should this be the special case?
datafusion/datafusion/optimizer/src/optimizer.rs
Lines 389 to 393 in 4109f58
from arrow-datafusion.
We usually ensure the schema remain unchanged after applying the optimize rule, should this be the special case?
I think we need to keep the schema the same (that is a pretty far reaching invariant)
But maybe we could do something like add a projection to convert it bac
So like if the input plan was
FilterExec [Utf8]
ParquetExec[Utf8]
We could implement an optimizer rule that made a plan like
ProjectionExec(cast(col) as Utf8) [utf8]
Filter[Utf8View]
ParquetExec[Utf8View]
🤔
from arrow-datafusion.
BTW the more I think about this the more I think it should probably be a PhysicalOptimizer rule (not a logical optimizer rule) as the change is related to the particular properties of the ExecutionPlans (rather than the logical type)
I'll try to prototype a physical optimizer
from arrow-datafusion.
Related Issues (20)
- Change `StatisticsConverter::row_group_counts` to return `None` for non existent columns in parquet files HOT 1
- Complete Possible Join Type Handling for EmptyRelation Propagation Rule HOT 12
- Presentation about DataFusion at Microsoft GSL (Gray Systems Lab) HOT 2
- Conditionally allow to keep partition_by columns when using PARTITIONED BY HOT 3
- Inconsistent behavior in HashJoin Projections HOT 2
- regen.sh is failing in Ubuntu when working with recommended version of protoc HOT 2
- Implement `DynamicTableProvider` in DataFusion Core HOT 4
- Open Variant Type for semi-structured data HOT 2
- push down unnest filter only on non-unnest column HOT 4
- 'common_sub_expression_eliminate' failed on unnested structs field HOT 2
- Implement equality = and inequality <> support for `BinaryView` HOT 3
- Improve filter predicates with `Utf8View` literals HOT 2
- Convert `ArrayAgg` to UDAF HOT 15
- Reduce test duplication in tests for data page stattistics HOT 1
- Add drop_columns to dataframe api HOT 1
- Convert `BoolAndOr` to UDAF HOT 1
- Add distinct_on to dataframe api HOT 1
- Implement min/max for interval types
- Pushdown filters that do not reference unested columns HOT 1
- Support named placeholders wherever numeric ones are allowed
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from arrow-datafusion.