What happens? I have this issue in my real test, but I created a m

Thanks! Sorry for late answer, went on vacation <div class="snippet-clipboard-cont

Fundamentally the bug comes from the ambiguity in: <a target="_blank

Related: <a class="issue-link js-issue-link" data-error-text="Failed to load title

[BUG] Spark nested columns (i.e. structs inside tables) breaks in comparisons about splink HOT 8 CLOSED

TinoSM commented on June 24, 2024

[BUG] Spark nested columns (i.e. structs inside tables) breaks in comparisons

from splink.

Comments (8)

TinoSM commented on June 24, 2024 1

Thanks! Sorry for late answer, went on vacation

                        "sql_condition" : "nested_column_l['first_name'] = nested_column_r['first_name']"

is indeed working in pyspark aswell.


        Comparison(
                comparison_dict={
                "comparison_description": "test",
                "comparison_levels": [
                        cll.null_level(
                            col_name = "first_name"
                        ),
                    ComparisonLevel(
                        {"label_for_charts": "test", 
                            "sql_condition" : "nested_column_l['first_name'] = nested_column_r['first_name']"
                        }, sql_dialect="spark"
                    ),
                    cll.else_level()
                ]
                }
        )

from splink.

TinoSM commented on June 24, 2024 1

The workaround provided is good enough for me, Ima close the ticket now, but if you want to keep track of it (to support dot-notation or maybe add documentation on the issue), feel free to reopen

Thanks @RobinL!

PS: The other option is pre-processing your data before you bring it into Splink to un-nest your column (i.e. put nested_column.first_name into a column called first_name <-- This can be hard with Arrays, we wanted to avoid this :)

from splink.

RobinL commented on June 24, 2024

Thanks for the report - I've had a look myself and I can confirm this is broken.

I think there is a workaround that works if you use the alternative syntax:

FAILS: nested_column.first_name
WORKS: nested_column['first_name']

Working version (only tested in duckdb)

from splink.datasets import splink_datasets
from splink.duckdb.blocking_rule_library import block_on, exact_match_rule
from splink.duckdb.comparison_library import (
    exact_match,
)
from splink.duckdb.linker import DuckDBLinker

df = splink_datasets.fake_1000
df["nested_column"] = df.apply(
    lambda row: {"first_name": row["first_name"], "surname": row["surname"]}, axis=1
)


settings = {
    "probability_two_random_records_match": 0.01,
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
         "l.nested_column['first_name'] = r.nested_column['first_name']",
    ],
    "comparisons": [
        {
            "output_column_name": "nested_column",
            "comparison_levels": [
                {
                    "sql_condition": "nested_column_l['surname'] IS NULL and nested_column_r['surname'] IS NULL",
                    "label_for_charts": "Null",
                    "is_null_level": True,
                },
                {
                    "sql_condition": "nested_column_l['surname']  = nested_column_r['surname']",
                    "label_for_charts": "Exact match",
                },
                {"sql_condition": "ELSE", "label_for_charts": "All other comparisons"},
            ],
            "comparison_description": "Exact match vs. anything else",
        }
    ],
    "retain_intermediate_calculation_columns": True,
}


linker = DuckDBLinker(df, settings)

df_predict = linker.predict()

The other option is pre-processing your data before you bring it into Splink to un-nest your column (i.e. put nested_column.first_name into a column called first_name

Here's a duckdb-based example that I'm putting here to quickly replicate the problem when we have time to look at fixing this:

Reprex

from splink.datasets import splink_datasets
from splink.duckdb.blocking_rule_library import block_on, exact_match_rule
from splink.duckdb.comparison_library import (
    exact_match,
)
from splink.duckdb.linker import DuckDBLinker

df = splink_datasets.fake_1000
df["nested_column"] = df.apply(
    lambda row: {"first_name": row["first_name"], "surname": row["surname"]}, axis=1
)


settings = {
    "probability_two_random_records_match": 0.01,
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        "l.nested_column.first_name = r.nested_column.first_name",
    ],
    "comparisons": [
        {
            "output_column_name": "nested_column",
            "comparison_levels": [
                {
                    "sql_condition": "nested_column_l.surname IS NULL and nested_column_r.surname IS NULL",
                    "label_for_charts": "Null",
                    "is_null_level": True,
                },
                {
                    "sql_condition": "nested_column_l.surname  = nested_column_r.surname ",
                    "label_for_charts": "Exact match",
                },
                {"sql_condition": "ELSE", "label_for_charts": "All other comparisons"},
            ],
            "comparison_description": "Exact match vs. anything else",
        }
    ],
    "retain_intermediate_calculation_columns": True,
}


linker = DuckDBLinker(df, settings)

df_predict = linker.predict()

from splink.

RobinL commented on June 24, 2024

Fundamentally the bug comes from the ambiguity in:

https://duckdb.org/docs/sql/data_types/struct.html#dot-notation-order-of-operations

from splink.

RobinL commented on June 24, 2024

Related:
#880

from splink.

TinoSM commented on June 24, 2024

Sorry, opening it again.

It does work when using it a single column (because sqlglot detects that column as "independant"). However this breaks down the line when using the same "sqlglot" column in multiple comparisons because of name-clashes. i.e.


        Comparison(
                comparison_dict={
                "comparison_description": "test",
                "comparison_levels": [
                    ComparisonLevel(
                        {"label_for_charts": "test", 
                            "sql_condition" : "nested_column_l['first_name'] = nested_column_r['first_name']"
                        }, sql_dialect="spark"
                    ),
                    cll.else_level()
                ]
                }
        ),
        Comparison(
                comparison_dict={
                "comparison_description": "test2",
                "comparison_levels": [
                    ComparisonLevel(
                        {"label_for_charts": "test", 
                            "sql_condition" : "nested_column_l['surname'] = nested_column_r['surname']"
                        }, sql_dialect="spark"
                    ),
                    cll.else_level()
                ]
                }
        )

Error was: [AMBIGUOUS_REFERENCE] Reference `gamma_nested_column` is ambiguous, could be: [`__splink__df_comparison_vectors`.`gamma_nested_column`, `__splink__df_comparison_vectors`.`gamma_nested_column`].

Log of the breaking SQL:

 __splink__df_comparison_vectors AS (
  SELECT
    "unique_id_l",
    "unique_id_r",
    "nested_column_l",
    "nested_column_r",
    CASE
      WHEN nested_column_l['first_name'] = nested_column_r['first_name']
      THEN 1
      ELSE 0
    END AS gamma_nested_column,
    CASE WHEN nested_column_l['surname'] = nested_column_r['surname'] THEN 1 ELSE 0 END AS gamma_nested_column,
    "surname_l",
    "surname_r"
  FROM __splink__df_blocked
)

I've been searching a little in splink's source code and found get_columns_used_from_sql in "parse_sql.py", the issue seems to be there, and brackets are more or less supported. However they don't seem to detect this use case.

Also those columns get spread to almost everywhere in the code under the assumption they are a valid column name both for input and output it looks :/.

I'm going to try and see if I can make all those nested structs plain (i.e. nested_column__surname) in my real model, as this looks way too breaking into the core (from an un-familiar/not really good opinion) to be easily fixed

from splink.

TinoSM commented on June 24, 2024

Updating again, managed to get it fixed with:
"output_column_name" : "comparison1",

            "output_column_name" : "comparison2",

"comparisons": [
        #ctl.name_comparison("first_name")#works fine
        #ctl.name_comparison("nested_column.first_name"), #broken
        #ctl.name_comparison("element_at(array_column,1)") #also broken
        Comparison(
                comparison_dict={
                "output_column_name" : "comparison2",
                "comparison_description": "test",
                "comparison_levels": [
                    ComparisonLevel(
                        {"label_for_charts": "test", 
                            "sql_condition" : "nested_column_l['first_name'] = nested_column_r['first_name']"
                        }, sql_dialect="spark"
                    ),
                    cll.else_level()
                ]
                }
        ),
        Comparison(
                comparison_dict={
                "output_column_name" : "comparison1",
                "comparison_description": "test2",
                "comparison_levels": [
                    ComparisonLevel(
                        {"label_for_charts": "test", 
                            "sql_condition" : "nested_column_l['surname'] = nested_column_r['surname']"
                        }, sql_dialect="spark"
                    ),
                    cll.else_level()
                ]
                }
        )
    ],

The issue is in the parse_sql.py:get_columns_used_from_sql method which is used to infer the "default" name for comparison.py:_output_column_name (based on the columns read by the user). When columns contains brackets, like nested_column_r['first_name'], the column is returned as nested_column, thus the brackets aint detected.

Not sure if the ticket should remain open or not, user experience is not good because of this, although workarounds are available, up to your judgement

from splink.

TinoSM commented on June 24, 2024

Adding a proposed solution by changing the default to use the conditions rather than the sql-glot column names, column names become quite long with it though (but they are only used in debug?), not sure if this can cause issues with other non-spark backends because of column name being too long

    @property
    def _output_column_name(self):
        if "output_column_name" in self._comparison_dict:
            return self._comparison_dict["output_column_name"]
        else:
            sql_conditions = []
            for cl in self.comparison_levels:
                #maybe filter by if cl.is_null_level: ?
                sql_conditions.append("".join(ch for ch in cl.sql_condition if ch.isalnum() or ch in ["_"]))
            
            if len(sql_conditions) == 1:
                return sql_conditions[0]
            else:
                return f"custom_{'_'.join(sql_conditions)}"

We could also go for using description as default i.e.

_output_column_name = self._comparison_description (which already uses _output_column_name if available) OR "take all alnums from sql conditions"

I'm doing a wrapper around splink so basically this is how the fix looks at my user-level API (aka fixing it outside of Splink):

                comparison_dict={
                "output_column_name": "".join(ch 
                                              for ch in cond.sql_condition.replace(" ","__") 
                                              if ch.isalnum() or ch in ["_"]),
                "comparison_description": cond.description,

from splink.

[BUG] Spark nested columns (i.e. structs inside tables) breaks in comparisons about splink HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent