Git Product home page Git Product logo

Comments (8)

TinoSM avatar TinoSM commented on June 24, 2024 1

Thanks! Sorry for late answer, went on vacation

                        "sql_condition" : "nested_column_l['first_name'] = nested_column_r['first_name']"

is indeed working in pyspark aswell.


        Comparison(
                comparison_dict={
                "comparison_description": "test",
                "comparison_levels": [
                        cll.null_level(
                            col_name = "first_name"
                        ),
                    ComparisonLevel(
                        {"label_for_charts": "test", 
                            "sql_condition" : "nested_column_l['first_name'] = nested_column_r['first_name']"
                        }, sql_dialect="spark"
                    ),
                    cll.else_level()
                ]
                }
        )

from splink.

TinoSM avatar TinoSM commented on June 24, 2024 1

The workaround provided is good enough for me, Ima close the ticket now, but if you want to keep track of it (to support dot-notation or maybe add documentation on the issue), feel free to reopen

Thanks @RobinL!

PS: The other option is pre-processing your data before you bring it into Splink to un-nest your column (i.e. put nested_column.first_name into a column called first_name <-- This can be hard with Arrays, we wanted to avoid this :)

from splink.

RobinL avatar RobinL commented on June 24, 2024

Thanks for the report - I've had a look myself and I can confirm this is broken.

I think there is a workaround that works if you use the alternative syntax:

  • FAILS: nested_column.first_name
  • WORKS: nested_column['first_name']
Working version (only tested in duckdb)
from splink.datasets import splink_datasets
from splink.duckdb.blocking_rule_library import block_on, exact_match_rule
from splink.duckdb.comparison_library import (
    exact_match,
)
from splink.duckdb.linker import DuckDBLinker

df = splink_datasets.fake_1000
df["nested_column"] = df.apply(
    lambda row: {"first_name": row["first_name"], "surname": row["surname"]}, axis=1
)


settings = {
    "probability_two_random_records_match": 0.01,
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
         "l.nested_column['first_name'] = r.nested_column['first_name']",
    ],
    "comparisons": [
        {
            "output_column_name": "nested_column",
            "comparison_levels": [
                {
                    "sql_condition": "nested_column_l['surname'] IS NULL and nested_column_r['surname'] IS NULL",
                    "label_for_charts": "Null",
                    "is_null_level": True,
                },
                {
                    "sql_condition": "nested_column_l['surname']  = nested_column_r['surname']",
                    "label_for_charts": "Exact match",
                },
                {"sql_condition": "ELSE", "label_for_charts": "All other comparisons"},
            ],
            "comparison_description": "Exact match vs. anything else",
        }
    ],
    "retain_intermediate_calculation_columns": True,
}


linker = DuckDBLinker(df, settings)

df_predict = linker.predict()

The other option is pre-processing your data before you bring it into Splink to un-nest your column (i.e. put nested_column.first_name into a column called first_name

Here's a duckdb-based example that I'm putting here to quickly replicate the problem when we have time to look at fixing this:

Reprex
from splink.datasets import splink_datasets
from splink.duckdb.blocking_rule_library import block_on, exact_match_rule
from splink.duckdb.comparison_library import (
    exact_match,
)
from splink.duckdb.linker import DuckDBLinker

df = splink_datasets.fake_1000
df["nested_column"] = df.apply(
    lambda row: {"first_name": row["first_name"], "surname": row["surname"]}, axis=1
)


settings = {
    "probability_two_random_records_match": 0.01,
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        "l.nested_column.first_name = r.nested_column.first_name",
    ],
    "comparisons": [
        {
            "output_column_name": "nested_column",
            "comparison_levels": [
                {
                    "sql_condition": "nested_column_l.surname IS NULL and nested_column_r.surname IS NULL",
                    "label_for_charts": "Null",
                    "is_null_level": True,
                },
                {
                    "sql_condition": "nested_column_l.surname  = nested_column_r.surname ",
                    "label_for_charts": "Exact match",
                },
                {"sql_condition": "ELSE", "label_for_charts": "All other comparisons"},
            ],
            "comparison_description": "Exact match vs. anything else",
        }
    ],
    "retain_intermediate_calculation_columns": True,
}


linker = DuckDBLinker(df, settings)

df_predict = linker.predict()

from splink.

RobinL avatar RobinL commented on June 24, 2024

Fundamentally the bug comes from the ambiguity in:

image

https://duckdb.org/docs/sql/data_types/struct.html#dot-notation-order-of-operations

from splink.

RobinL avatar RobinL commented on June 24, 2024

Related:
#880

from splink.

TinoSM avatar TinoSM commented on June 24, 2024

Sorry, opening it again.

It does work when using it a single column (because sqlglot detects that column as "independant"). However this breaks down the line when using the same "sqlglot" column in multiple comparisons because of name-clashes. i.e.


        Comparison(
                comparison_dict={
                "comparison_description": "test",
                "comparison_levels": [
                    ComparisonLevel(
                        {"label_for_charts": "test", 
                            "sql_condition" : "nested_column_l['first_name'] = nested_column_r['first_name']"
                        }, sql_dialect="spark"
                    ),
                    cll.else_level()
                ]
                }
        ),
        Comparison(
                comparison_dict={
                "comparison_description": "test2",
                "comparison_levels": [
                    ComparisonLevel(
                        {"label_for_charts": "test", 
                            "sql_condition" : "nested_column_l['surname'] = nested_column_r['surname']"
                        }, sql_dialect="spark"
                    ),
                    cll.else_level()
                ]
                }
        )

Error was: [AMBIGUOUS_REFERENCE] Reference `gamma_nested_column` is ambiguous, could be: [`__splink__df_comparison_vectors`.`gamma_nested_column`, `__splink__df_comparison_vectors`.`gamma_nested_column`].

Log of the breaking SQL:

 __splink__df_comparison_vectors AS (
  SELECT
    "unique_id_l",
    "unique_id_r",
    "nested_column_l",
    "nested_column_r",
    CASE
      WHEN nested_column_l['first_name'] = nested_column_r['first_name']
      THEN 1
      ELSE 0
    END AS gamma_nested_column,
    CASE WHEN nested_column_l['surname'] = nested_column_r['surname'] THEN 1 ELSE 0 END AS gamma_nested_column,
    "surname_l",
    "surname_r"
  FROM __splink__df_blocked
)

I've been searching a little in splink's source code and found get_columns_used_from_sql in "parse_sql.py", the issue seems to be there, and brackets are more or less supported. However they don't seem to detect this use case.

Also those columns get spread to almost everywhere in the code under the assumption they are a valid column name both for input and output it looks :/.

I'm going to try and see if I can make all those nested structs plain (i.e. nested_column__surname) in my real model, as this looks way too breaking into the core (from an un-familiar/not really good opinion) to be easily fixed

from splink.

TinoSM avatar TinoSM commented on June 24, 2024

Updating again, managed to get it fixed with:
"output_column_name" : "comparison1",

            "output_column_name" : "comparison2",
"comparisons": [
        #ctl.name_comparison("first_name")#works fine
        #ctl.name_comparison("nested_column.first_name"), #broken
        #ctl.name_comparison("element_at(array_column,1)") #also broken
        Comparison(
                comparison_dict={
                "output_column_name" : "comparison2",
                "comparison_description": "test",
                "comparison_levels": [
                    ComparisonLevel(
                        {"label_for_charts": "test", 
                            "sql_condition" : "nested_column_l['first_name'] = nested_column_r['first_name']"
                        }, sql_dialect="spark"
                    ),
                    cll.else_level()
                ]
                }
        ),
        Comparison(
                comparison_dict={
                "output_column_name" : "comparison1",
                "comparison_description": "test2",
                "comparison_levels": [
                    ComparisonLevel(
                        {"label_for_charts": "test", 
                            "sql_condition" : "nested_column_l['surname'] = nested_column_r['surname']"
                        }, sql_dialect="spark"
                    ),
                    cll.else_level()
                ]
                }
        )
    ],

The issue is in the parse_sql.py:get_columns_used_from_sql method which is used to infer the "default" name for comparison.py:_output_column_name (based on the columns read by the user). When columns contains brackets, like nested_column_r['first_name'], the column is returned as nested_column, thus the brackets aint detected.

Not sure if the ticket should remain open or not, user experience is not good because of this, although workarounds are available, up to your judgement

from splink.

TinoSM avatar TinoSM commented on June 24, 2024

Adding a proposed solution by changing the default to use the conditions rather than the sql-glot column names, column names become quite long with it though (but they are only used in debug?), not sure if this can cause issues with other non-spark backends because of column name being too long

    @property
    def _output_column_name(self):
        if "output_column_name" in self._comparison_dict:
            return self._comparison_dict["output_column_name"]
        else:
            sql_conditions = []
            for cl in self.comparison_levels:
                #maybe filter by if cl.is_null_level: ?
                sql_conditions.append("".join(ch for ch in cl.sql_condition if ch.isalnum() or ch in ["_"]))
            
            if len(sql_conditions) == 1:
                return sql_conditions[0]
            else:
                return f"custom_{'_'.join(sql_conditions)}"

We could also go for using description as default i.e.

_output_column_name = self._comparison_description (which already uses _output_column_name if available) OR "take all alnums from sql conditions"

I'm doing a wrapper around splink so basically this is how the fix looks at my user-level API (aka fixing it outside of Splink):

                comparison_dict={
                "output_column_name": "".join(ch 
                                              for ch in cond.sql_condition.replace(" ","__") 
                                              if ch.isalnum() or ch in ["_"]),
                "comparison_description": cond.description,

from splink.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.