Comments (17)
I can work on it! :)
from chispa.
It's possible this PR is what introduce the bug: #34. Or perhaps this bug has always existed.
from chispa.
I added a test to illustrate assert_df_equality
is working as intended with ignore_schema=True
, but assert_approx_df_equality
is not working properly with ignore_schema=True
, see this commit: 1e64612
FYI @orcascope
from chispa.
I feel like this bug had always been there. What is the expected functionality to compare two data frames, when the two data frames got different field names ?
The schema_equality functions checks for name, datatype and optionally nullability match between the two schemas.
The are_rows_approx_equal
function is iterating on the field names. How does it work when we ignore_schema ?
@MrPowers @robertkossendey
from chispa.
@robertkossendey - assigned it to you ;)
@orcascope - the expectation when ignore_schema=True
is to strictly compare DataFrame contents. So column names can be different. But entirely different rows shouldn't be considered equal. Per the tests I added, it seems like assert_df_equality
is working as expected and assert_approx_df_equality
isn't working as expected. Not sure what the underlying issue is here, but we can just fix it and make the code better 🚀 Will definitely appreciate your code review on this one!
from chispa.
@MrPowers I am not sure if ignore_schema as of today works, since if you add the same test for assert_df_equality you end up with the same problem.
from chispa.
@robertkossendey - I might be missing something. I thought I did add the exact same test for assert_df_equality in this commit...
In any case, let's fix it in both places if it's broken!
from chispa.
Sorry, it seems like only posted half of my message:
I fixed the bug, but I discovered that another test now fails. This got me thinking: how should ignore_schema behave? And can you pass ignore_schema in combination with ignore_column_order?
Or does ignore_schema only mean: "I don't care how my columns are called". Since for comparison we need to rely on the order of the columns if they are named differently.
Maybe I am missing something, because what we do in assert_df_equality is rows1 != rows2:
which requires columns to be in the same order.
from chispa.
@robertkossendey - can you push up your PR, so I can take a look?
Looks like the feature was originally added here: #31
It just looks like the current ignore_schema
logic is completely broken.
from chispa.
@MrPowers done!
from chispa.
@robertkossendey - looked at your PR and looks like it's going down the right path. We currently have the row_comparer
, schema_comparer
, and dataframe_comparer
abstractions. I was thinking about this some more and think the missing abstraction is rows_comparer
.
I don't think assert_df_equality(df1, df2, ignore_schema=True)
is particularly elegant. It seems like the method we should be exposing is assert_rows_equality(df1.collect(), df2.collect())
.
The current implementation allows for all sorts of weird argument mismatches that make no sense. Thoughts?
from chispa.
@MrPowers yes, I think we should re-think the current API, since how would you compare with ignore_schema
, ignore_column_order
& ignore_row_order
set to true
from chispa.
@robertkossendey @orcascope - can you please review this PR? #56
from chispa.
Let me just summarize the changes I see.
The option to ignore_schema
will be removed from the dataframe compare functions. This means that both dataframes should have the column names and datatypes matching to pass the test. There is no change to the option to ignore_nullable in the Structfield.
This change basically removes the root cause of the unexpected functionality happening in this loop.
for key in d1.keys() & d2.keys():
This looks good.
Why do you chose to remove the collect() happening inside the row_compare functions and instead doing collect while calling the function ? @MrPowers
from chispa.
@orcascope - think that's a good summary of the changes, thanks for reviewing.
We now have a few different comparers:
dataframe_comparer.py
:assert_df_equality(df1, df2)
- takes DataFrame argumentsrows_comparer.py
:assert_basic_rows_equality(rows1, rows2)
- takes rows argumentsrow_comparer.py
:are_rows_equal_enhanced(r1: Row, r2: Row)
- takes row arguments
I chose to remove the collect()
when calling the function because think it makes more sense for assert_basic_rows_equality
to take rows arguments.
The other observation I will add is that there is a desire from some users to be able to compare DataFrame contents without considering the schema at all. I think the assert_basic_rows_equality
/ assert_generic_rows_equality
can give them that feature. That's why I don't think we need assert_df_equality(df1, df2, ignore_schema=True)
anymore.
Feel free to ping with more thoughts / questions. Really appreciate it.
from chispa.
@MrPowers Sounds good.
from chispa.
ignore_schema
option was removed to fix this bug: c12a501
from chispa.
Related Issues (20)
- pkg_resources is deprecated, prettytable produces warning HOT 1
- Reconfigure CI to run tests on pull requests
- Run tests for multiple PySpark versions in CI HOT 5
- Refactor code to conform to PEP8 HOT 5
- Font colors in error messages are bad in some terminals HOT 1
- ignore_schema in assert_df_equality removed in 9.3? HOT 6
- assert_df_equality throws SchemasNotEqualError when the dataframes are identical (except for the metadata) HOT 3
- Replace poetry dev-dependencies with dependency groups HOT 2
- Add unit tests to highlight limitations of this library HOT 1
- Make proper StructField comparer and DataType comparer abstractions
- Give user control to customize output formatting HOT 2
- SchemaNotEqualError not showing the difference in metadata
- Issues on Python 3.10 due to six 1.15.0 HOT 4
- Unit tests are only run against a single version of Python on the `main` branch.
- underline_cells failing if dataframes are different lengths HOT 1
- assert None while ignore_metadata=True
- Unit testing the code with Spark Connect
- SchemaNotEqual error is unreadable for wide schemas HOT 1
- chispa 1.0 release
- Investigate "SPARK_TESTING" environment variable
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from chispa.