This test shouldn't be passing: <div class="snippet-clipboard-content notranslate

It's possible this PR is what introduce the bug: <a class="issue-link js-issue-link" d

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

The it_does_not_throw_with_different_schema test exposes a bug about chispa HOT 17 CLOSED

mrpowers commented on May 31, 2024

The it_does_not_throw_with_different_schema test exposes a bug

from chispa.

Comments (17)

robertkossendey commented on May 31, 2024 1

I can work on it! :)

from chispa.

MrPowers commented on May 31, 2024

It's possible this PR is what introduce the bug: #34. Or perhaps this bug has always existed.

from chispa.

MrPowers commented on May 31, 2024

I added a test to illustrate assert_df_equality is working as intended with ignore_schema=True, but assert_approx_df_equality is not working properly with ignore_schema=True, see this commit: 1e64612

FYI @orcascope

from chispa.

orcascope commented on May 31, 2024

I feel like this bug had always been there. What is the expected functionality to compare two data frames, when the two data frames got different field names ?
The schema_equality functions checks for name, datatype and optionally nullability match between the two schemas.
The are_rows_approx_equal function is iterating on the field names. How does it work when we ignore_schema ?
@MrPowers @robertkossendey

from chispa.

MrPowers commented on May 31, 2024

@robertkossendey - assigned it to you ;)

@orcascope - the expectation when ignore_schema=True is to strictly compare DataFrame contents. So column names can be different. But entirely different rows shouldn't be considered equal. Per the tests I added, it seems like assert_df_equality is working as expected and assert_approx_df_equality isn't working as expected. Not sure what the underlying issue is here, but we can just fix it and make the code better 🚀 Will definitely appreciate your code review on this one!

from chispa.

robertkossendey commented on May 31, 2024

@MrPowers I am not sure if ignore_schema as of today works, since if you add the same test for assert_df_equality you end up with the same problem.

from chispa.

MrPowers commented on May 31, 2024

@robertkossendey - I might be missing something. I thought I did add the exact same test for assert_df_equality in this commit...

In any case, let's fix it in both places if it's broken!

from chispa.

robertkossendey commented on May 31, 2024

Sorry, it seems like only posted half of my message:

I fixed the bug, but I discovered that another test now fails. This got me thinking: how should ignore_schema behave? And can you pass ignore_schema in combination with ignore_column_order?

Or does ignore_schema only mean: "I don't care how my columns are called". Since for comparison we need to rely on the order of the columns if they are named differently.

Maybe I am missing something, because what we do in assert_df_equality is rows1 != rows2: which requires columns to be in the same order.

from chispa.

MrPowers commented on May 31, 2024

@robertkossendey - can you push up your PR, so I can take a look?

Looks like the feature was originally added here: #31

It just looks like the current ignore_schema logic is completely broken.

from chispa.

robertkossendey commented on May 31, 2024

@MrPowers done!

from chispa.

MrPowers commented on May 31, 2024

@robertkossendey - looked at your PR and looks like it's going down the right path. We currently have the row_comparer, schema_comparer, and dataframe_comparer abstractions. I was thinking about this some more and think the missing abstraction is rows_comparer.

I don't think assert_df_equality(df1, df2, ignore_schema=True) is particularly elegant. It seems like the method we should be exposing is assert_rows_equality(df1.collect(), df2.collect()).

The current implementation allows for all sorts of weird argument mismatches that make no sense. Thoughts?

from chispa.

robertkossendey commented on May 31, 2024

@MrPowers yes, I think we should re-think the current API, since how would you compare with ignore_schema, ignore_column_order & ignore_row_order set to true

from chispa.

MrPowers commented on May 31, 2024

@robertkossendey @orcascope - can you please review this PR? #56

from chispa.

orcascope commented on May 31, 2024

Let me just summarize the changes I see.

The option to ignore_schema will be removed from the dataframe compare functions. This means that both dataframes should have the column names and datatypes matching to pass the test. There is no change to the option to ignore_nullable in the Structfield.

This change basically removes the root cause of the unexpected functionality happening in this loop.
for key in d1.keys() & d2.keys():

This looks good.

Why do you chose to remove the collect() happening inside the row_compare functions and instead doing collect while calling the function ? @MrPowers

from chispa.

MrPowers commented on May 31, 2024

@orcascope - think that's a good summary of the changes, thanks for reviewing.

We now have a few different comparers:

dataframe_comparer.py: assert_df_equality(df1, df2) - takes DataFrame arguments
rows_comparer.py: assert_basic_rows_equality(rows1, rows2) - takes rows arguments
row_comparer.py: are_rows_equal_enhanced(r1: Row, r2: Row) - takes row arguments

I chose to remove the collect() when calling the function because think it makes more sense for assert_basic_rows_equality to take rows arguments.

The other observation I will add is that there is a desire from some users to be able to compare DataFrame contents without considering the schema at all. I think the assert_basic_rows_equality / assert_generic_rows_equality can give them that feature. That's why I don't think we need assert_df_equality(df1, df2, ignore_schema=True) anymore.

Feel free to ping with more thoughts / questions. Really appreciate it.

from chispa.

orcascope commented on May 31, 2024

@MrPowers Sounds good.

from chispa.

MrPowers commented on May 31, 2024

ignore_schema option was removed to fix this bug: c12a501

from chispa.

The it_does_not_throw_with_different_schema test exposes a bug about chispa HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent