Comments (3)
Oh now I see how the hash conflict is resolved. Thanks for your kindly reply!
from arrow-datafusion.
Hash collision tests are in place to guard against these types of issues, which is why I don't foresee any problem in this area.
To explain further, the hash table algorithm stores the raw hash values. As you've pointed out, hash collisions can occur for different keys. However, after retrieving the indices, we perform a vectorized equality check. This step allows us to resolve any hash collisions that may arise.
from arrow-datafusion.
You may check this method and its usage
pub fn equal_rows_arr(
indices_left: &UInt64Array,
indices_right: &UInt32Array,
left_arrays: &[ArrayRef],
right_arrays: &[ArrayRef],
null_equals_null: bool,
) -> Result<(UInt64Array, UInt32Array)> {
let mut iter = left_arrays.iter().zip(right_arrays.iter());
let (first_left, first_right) = iter.next().ok_or_else(|| {
DataFusionError::Internal(
"At least one array should be provided for both left and right".to_string(),
)
})?;
let arr_left = take(first_left.as_ref(), indices_left, None)?;
let arr_right = take(first_right.as_ref(), indices_right, None)?;
let mut equal: BooleanArray = eq_dyn_null(&arr_left, &arr_right, null_equals_null)?;
// Use map and try_fold to iterate over the remaining pairs of arrays.
// In each iteration, take is used on the pair of arrays and their equality is determined.
// The results are then folded (combined) using the and function to get a final equality result.
equal = iter
.map(|(left, right)| {
let arr_left = take(left.as_ref(), indices_left, None)?;
let arr_right = take(right.as_ref(), indices_right, None)?;
eq_dyn_null(arr_left.as_ref(), arr_right.as_ref(), null_equals_null)
})
.try_fold(equal, |acc, equal2| and(&acc, &equal2?))?;
let filter_builder = FilterBuilder::new(&equal).optimize().build();
let left_filtered = filter_builder.filter(indices_left)?;
let right_filtered = filter_builder.filter(indices_right)?;
Ok((
downcast_array(left_filtered.as_ref()),
downcast_array(right_filtered.as_ref()),
))
}
from arrow-datafusion.
Related Issues (20)
- unnest doesn't take into account null values HOT 2
- Convert `IPCWriter` metrics from `u64` to `usize` HOT 1
- Move `ceil`, `exp`, `factorial` to `datafusion-functions` crate HOT 4
- CrossJoin Implementation on (M x N) Partitions
- `COUNT(1)` is different from `COUNT(*)`
- Overwritten Format Configs by CreateExternalTable Options HOT 4
- Trailing comma output misleading error message HOT 2
- [Python] Converting a dataframe to a python list results in a higher precision number
- regression in casting between 35 and 36. HOT 2
- Remove builtin aggregate function `FirstValue`
- Prune columns / pages that are all `null` in `ParquetExec` by connecting up row_counts in pruning statistics HOT 3
- [EPIC] Improve the performance of ListingTable HOT 25
- Improve the performance of lower and upper function
- Move conversion of FIRST/LAST Aggregate function to independent physical optimizer rule HOT 12
- Consolidate push_down_projection tests under optimize_projections HOT 1
- get error value if timestamp represented by the INT96 in the parquet file HOT 1
- confused result if ignore the timezone when cast timestamp to the date32 HOT 11
- Add clarifying documentation to the TableProvider supports_filters_pushdown HOT 1
- Enable the UDF function to collect its metrics. HOT 2
- Consolidate `LogicalPlan` tree node walking / rewriting code in one place
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from arrow-datafusion.