Comments (3)
Here is some additional information. When I run TPC-H query 5 in the benchmarks, against DataFusion, I see that the physical plan used partitioned joins.
For example, I see that both inputs to the join are partitioned on the join keys, and the join mode is Partitioned
.
HashJoinExec: mode=Partitioned, join_type=Inner, on=[("c_custkey", "o_custkey")]
RepartitionExec: partitioning=Hash([Column { name: "c_custkey" }], 24)
ParquetExec: batch_size=8192, limit=None, partitions=[...]
RepartitionExec: partitioning=Hash([Column { name: "o_custkey" }], 24)
FilterExec: o_orderdate >= CAST(1994-01-01 AS Date32) AND o_orderdate < CAST(1995-01-01 AS Date32)
ParquetExec: batch_size=8192, limit=None, partitions=[...]
This means that the join can run in parallel because the inputs are partitioned. So partition 1 of the join reads partition 1 of the left and right inputs, and so on.
When I run the same query against Ballista, I see.
HashJoinExec: mode=CollectLeft, join_type=Inner, on=[("c_custkey", "o_custkey")]
ParquetExec: batch_size=8192, limit=None, partitions=[...]
FilterExec: o_orderdate >= CAST(1994-01-01 AS Date32) AND o_orderdate < CAST(1995-01-01 AS Date32)
ParquetExec: batch_size=8192, limit=None, partitions=[
Here, we see join mode CollectLeft
, which means that each partition being executed will go and fetch the entire left-side of the join into memory. This is very inefficient both in terms of memory and compute and potentially gets exponentially slower the more partitions we have.
What we need to do is apply the same "partitioned hash join" pattern to Ballista.
from datafusion.
I created a Google doc to discuss the design, and planned work, in more detail.
https://docs.google.com/document/d/1yUnGWsHKYOAxWijDJisEFYU4dIym_GSRSMpwfWjVZq8/edit?usp=sharing
from datafusion.
I'd love to work on this if someone can provide further reading material and/or the area in the code
from datafusion.
Related Issues (20)
- DataFusion ignores "column order" parquet statistics specification
- DataFusion reads Date32 and Date64 parquet statistics in as Int32Array HOT 2
- Pass per-field BigQuery `OPTIONS` values to the LogicalPlan's Arrow Schema
- Expand Test Coverage for ScalarUDF's
- Make the configuration for `StreamTable` more generic to support more stream sources
- Support `date_bin` on timestamps with timezone, properly accounting for Daylight Savings Time HOT 12
- Incorrect statistics read for unsigned integer columns in parquet HOT 1
- Incorrect statistics read for binary columns in parquet
- Implement a benchmark for extracting arrow statistics from parquet HOT 1
- Incorrect statistics read for struct array in parquet HOT 1
- PlaceholderRowExec shown when select from union results. HOT 2
- Make it easier to register object stores HOT 2
- MySQL doesn't support the `NULLS FIRST/LAST` clause in `ORDER BY` statements
- Improve performance of extracting statistics from parquet files HOT 1
- Examples of using `TreeNode` APIs to walk and manipulate LogicalPlans HOT 2
- The `limit` info lost in the AggregateExec when ser/deser the physical plan HOT 6
- Make TaskContext wrap SessionState
- Make SQL strings generated from Exprs even "prettier"
- Consolidate tests for unparser / plan to sql to make them easier to find
- Implement protobuf serialization for LogicalPlan::Unnest
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from datafusion.