Git Product home page Git Product logo

Comments (17)

robertkossendey avatar robertkossendey commented on May 31, 2024 1

I can work on it! :)

from chispa.

MrPowers avatar MrPowers commented on May 31, 2024

It's possible this PR is what introduce the bug: #34. Or perhaps this bug has always existed.

from chispa.

MrPowers avatar MrPowers commented on May 31, 2024

I added a test to illustrate assert_df_equality is working as intended with ignore_schema=True, but assert_approx_df_equality is not working properly with ignore_schema=True, see this commit: 1e64612

FYI @orcascope

from chispa.

orcascope avatar orcascope commented on May 31, 2024

I feel like this bug had always been there. What is the expected functionality to compare two data frames, when the two data frames got different field names ?
The schema_equality functions checks for name, datatype and optionally nullability match between the two schemas.
The are_rows_approx_equal function is iterating on the field names. How does it work when we ignore_schema ?
@MrPowers @robertkossendey

from chispa.

MrPowers avatar MrPowers commented on May 31, 2024

@robertkossendey - assigned it to you ;)

@orcascope - the expectation when ignore_schema=True is to strictly compare DataFrame contents. So column names can be different. But entirely different rows shouldn't be considered equal. Per the tests I added, it seems like assert_df_equality is working as expected and assert_approx_df_equality isn't working as expected. Not sure what the underlying issue is here, but we can just fix it and make the code better 🚀 Will definitely appreciate your code review on this one!

from chispa.

robertkossendey avatar robertkossendey commented on May 31, 2024

@MrPowers I am not sure if ignore_schema as of today works, since if you add the same test for assert_df_equality you end up with the same problem.

from chispa.

MrPowers avatar MrPowers commented on May 31, 2024

@robertkossendey - I might be missing something. I thought I did add the exact same test for assert_df_equality in this commit...

In any case, let's fix it in both places if it's broken!

from chispa.

robertkossendey avatar robertkossendey commented on May 31, 2024

Sorry, it seems like only posted half of my message:

I fixed the bug, but I discovered that another test now fails. This got me thinking: how should ignore_schema behave? And can you pass ignore_schema in combination with ignore_column_order?

Or does ignore_schema only mean: "I don't care how my columns are called". Since for comparison we need to rely on the order of the columns if they are named differently.

Maybe I am missing something, because what we do in assert_df_equality is rows1 != rows2: which requires columns to be in the same order.

from chispa.

MrPowers avatar MrPowers commented on May 31, 2024

@robertkossendey - can you push up your PR, so I can take a look?

Looks like the feature was originally added here: #31

It just looks like the current ignore_schema logic is completely broken.

from chispa.

robertkossendey avatar robertkossendey commented on May 31, 2024

@MrPowers done!

from chispa.

MrPowers avatar MrPowers commented on May 31, 2024

@robertkossendey - looked at your PR and looks like it's going down the right path. We currently have the row_comparer, schema_comparer, and dataframe_comparer abstractions. I was thinking about this some more and think the missing abstraction is rows_comparer.

I don't think assert_df_equality(df1, df2, ignore_schema=True) is particularly elegant. It seems like the method we should be exposing is assert_rows_equality(df1.collect(), df2.collect()).

The current implementation allows for all sorts of weird argument mismatches that make no sense. Thoughts?

from chispa.

robertkossendey avatar robertkossendey commented on May 31, 2024

@MrPowers yes, I think we should re-think the current API, since how would you compare with ignore_schema, ignore_column_order & ignore_row_order set to true

from chispa.

MrPowers avatar MrPowers commented on May 31, 2024

@robertkossendey @orcascope - can you please review this PR? #56

from chispa.

orcascope avatar orcascope commented on May 31, 2024

Let me just summarize the changes I see.

The option to ignore_schema will be removed from the dataframe compare functions. This means that both dataframes should have the column names and datatypes matching to pass the test. There is no change to the option to ignore_nullable in the Structfield.

This change basically removes the root cause of the unexpected functionality happening in this loop.
for key in d1.keys() & d2.keys():

This looks good.

Why do you chose to remove the collect() happening inside the row_compare functions and instead doing collect while calling the function ? @MrPowers

from chispa.

MrPowers avatar MrPowers commented on May 31, 2024

@orcascope - think that's a good summary of the changes, thanks for reviewing.

We now have a few different comparers:

  • dataframe_comparer.py: assert_df_equality(df1, df2) - takes DataFrame arguments
  • rows_comparer.py: assert_basic_rows_equality(rows1, rows2) - takes rows arguments
  • row_comparer.py: are_rows_equal_enhanced(r1: Row, r2: Row) - takes row arguments

I chose to remove the collect() when calling the function because think it makes more sense for assert_basic_rows_equality to take rows arguments.

The other observation I will add is that there is a desire from some users to be able to compare DataFrame contents without considering the schema at all. I think the assert_basic_rows_equality / assert_generic_rows_equality can give them that feature. That's why I don't think we need assert_df_equality(df1, df2, ignore_schema=True) anymore.

Feel free to ping with more thoughts / questions. Really appreciate it.

from chispa.

orcascope avatar orcascope commented on May 31, 2024

@MrPowers Sounds good.

from chispa.

MrPowers avatar MrPowers commented on May 31, 2024

ignore_schema option was removed to fix this bug: c12a501

from chispa.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.