Comments (8)
Thanks! Sorry for late answer, went on vacation
"sql_condition" : "nested_column_l['first_name'] = nested_column_r['first_name']"
is indeed working in pyspark aswell.
Comparison(
comparison_dict={
"comparison_description": "test",
"comparison_levels": [
cll.null_level(
col_name = "first_name"
),
ComparisonLevel(
{"label_for_charts": "test",
"sql_condition" : "nested_column_l['first_name'] = nested_column_r['first_name']"
}, sql_dialect="spark"
),
cll.else_level()
]
}
)
from splink.
The workaround provided is good enough for me, Ima close the ticket now, but if you want to keep track of it (to support dot-notation or maybe add documentation on the issue), feel free to reopen
Thanks @RobinL!
PS: The other option is pre-processing your data before you bring it into Splink to un-nest your column (i.e. put nested_column.first_name into a column called first_name <-- This can be hard with Arrays, we wanted to avoid this :)
from splink.
Thanks for the report - I've had a look myself and I can confirm this is broken.
I think there is a workaround that works if you use the alternative syntax:
- FAILS:
nested_column.first_name
- WORKS:
nested_column['first_name']
Working version (only tested in duckdb)
from splink.datasets import splink_datasets
from splink.duckdb.blocking_rule_library import block_on, exact_match_rule
from splink.duckdb.comparison_library import (
exact_match,
)
from splink.duckdb.linker import DuckDBLinker
df = splink_datasets.fake_1000
df["nested_column"] = df.apply(
lambda row: {"first_name": row["first_name"], "surname": row["surname"]}, axis=1
)
settings = {
"probability_two_random_records_match": 0.01,
"link_type": "dedupe_only",
"blocking_rules_to_generate_predictions": [
"l.nested_column['first_name'] = r.nested_column['first_name']",
],
"comparisons": [
{
"output_column_name": "nested_column",
"comparison_levels": [
{
"sql_condition": "nested_column_l['surname'] IS NULL and nested_column_r['surname'] IS NULL",
"label_for_charts": "Null",
"is_null_level": True,
},
{
"sql_condition": "nested_column_l['surname'] = nested_column_r['surname']",
"label_for_charts": "Exact match",
},
{"sql_condition": "ELSE", "label_for_charts": "All other comparisons"},
],
"comparison_description": "Exact match vs. anything else",
}
],
"retain_intermediate_calculation_columns": True,
}
linker = DuckDBLinker(df, settings)
df_predict = linker.predict()
The other option is pre-processing your data before you bring it into Splink to un-nest your column (i.e. put nested_column.first_name
into a column called first_name
Here's a duckdb-based example that I'm putting here to quickly replicate the problem when we have time to look at fixing this:
Reprex
from splink.datasets import splink_datasets
from splink.duckdb.blocking_rule_library import block_on, exact_match_rule
from splink.duckdb.comparison_library import (
exact_match,
)
from splink.duckdb.linker import DuckDBLinker
df = splink_datasets.fake_1000
df["nested_column"] = df.apply(
lambda row: {"first_name": row["first_name"], "surname": row["surname"]}, axis=1
)
settings = {
"probability_two_random_records_match": 0.01,
"link_type": "dedupe_only",
"blocking_rules_to_generate_predictions": [
"l.nested_column.first_name = r.nested_column.first_name",
],
"comparisons": [
{
"output_column_name": "nested_column",
"comparison_levels": [
{
"sql_condition": "nested_column_l.surname IS NULL and nested_column_r.surname IS NULL",
"label_for_charts": "Null",
"is_null_level": True,
},
{
"sql_condition": "nested_column_l.surname = nested_column_r.surname ",
"label_for_charts": "Exact match",
},
{"sql_condition": "ELSE", "label_for_charts": "All other comparisons"},
],
"comparison_description": "Exact match vs. anything else",
}
],
"retain_intermediate_calculation_columns": True,
}
linker = DuckDBLinker(df, settings)
df_predict = linker.predict()
from splink.
Fundamentally the bug comes from the ambiguity in:
https://duckdb.org/docs/sql/data_types/struct.html#dot-notation-order-of-operations
from splink.
Related:
#880
from splink.
Sorry, opening it again.
It does work when using it a single column (because sqlglot detects that column as "independant"). However this breaks down the line when using the same "sqlglot" column in multiple comparisons because of name-clashes. i.e.
Comparison(
comparison_dict={
"comparison_description": "test",
"comparison_levels": [
ComparisonLevel(
{"label_for_charts": "test",
"sql_condition" : "nested_column_l['first_name'] = nested_column_r['first_name']"
}, sql_dialect="spark"
),
cll.else_level()
]
}
),
Comparison(
comparison_dict={
"comparison_description": "test2",
"comparison_levels": [
ComparisonLevel(
{"label_for_charts": "test",
"sql_condition" : "nested_column_l['surname'] = nested_column_r['surname']"
}, sql_dialect="spark"
),
cll.else_level()
]
}
)
Error was: [AMBIGUOUS_REFERENCE] Reference `gamma_nested_column` is ambiguous, could be: [`__splink__df_comparison_vectors`.`gamma_nested_column`, `__splink__df_comparison_vectors`.`gamma_nested_column`].
Log of the breaking SQL:
__splink__df_comparison_vectors AS (
SELECT
"unique_id_l",
"unique_id_r",
"nested_column_l",
"nested_column_r",
CASE
WHEN nested_column_l['first_name'] = nested_column_r['first_name']
THEN 1
ELSE 0
END AS gamma_nested_column,
CASE WHEN nested_column_l['surname'] = nested_column_r['surname'] THEN 1 ELSE 0 END AS gamma_nested_column,
"surname_l",
"surname_r"
FROM __splink__df_blocked
)
I've been searching a little in splink's source code and found get_columns_used_from_sql in "parse_sql.py", the issue seems to be there, and brackets are more or less supported. However they don't seem to detect this use case.
Also those columns get spread to almost everywhere in the code under the assumption they are a valid column name both for input and output it looks :/.
I'm going to try and see if I can make all those nested structs plain (i.e. nested_column__surname) in my real model, as this looks way too breaking into the core (from an un-familiar/not really good opinion) to be easily fixed
from splink.
Updating again, managed to get it fixed with:
"output_column_name" : "comparison1",
"output_column_name" : "comparison2",
"comparisons": [
#ctl.name_comparison("first_name")#works fine
#ctl.name_comparison("nested_column.first_name"), #broken
#ctl.name_comparison("element_at(array_column,1)") #also broken
Comparison(
comparison_dict={
"output_column_name" : "comparison2",
"comparison_description": "test",
"comparison_levels": [
ComparisonLevel(
{"label_for_charts": "test",
"sql_condition" : "nested_column_l['first_name'] = nested_column_r['first_name']"
}, sql_dialect="spark"
),
cll.else_level()
]
}
),
Comparison(
comparison_dict={
"output_column_name" : "comparison1",
"comparison_description": "test2",
"comparison_levels": [
ComparisonLevel(
{"label_for_charts": "test",
"sql_condition" : "nested_column_l['surname'] = nested_column_r['surname']"
}, sql_dialect="spark"
),
cll.else_level()
]
}
)
],
The issue is in the parse_sql.py:get_columns_used_from_sql method which is used to infer the "default" name for comparison.py:_output_column_name (based on the columns read by the user). When columns contains brackets, like nested_column_r['first_name'], the column is returned as nested_column, thus the brackets aint detected.
Not sure if the ticket should remain open or not, user experience is not good because of this, although workarounds are available, up to your judgement
from splink.
Adding a proposed solution by changing the default to use the conditions rather than the sql-glot column names, column names become quite long with it though (but they are only used in debug?), not sure if this can cause issues with other non-spark backends because of column name being too long
@property
def _output_column_name(self):
if "output_column_name" in self._comparison_dict:
return self._comparison_dict["output_column_name"]
else:
sql_conditions = []
for cl in self.comparison_levels:
#maybe filter by if cl.is_null_level: ?
sql_conditions.append("".join(ch for ch in cl.sql_condition if ch.isalnum() or ch in ["_"]))
if len(sql_conditions) == 1:
return sql_conditions[0]
else:
return f"custom_{'_'.join(sql_conditions)}"
We could also go for using description as default i.e.
_output_column_name = self._comparison_description (which already uses _output_column_name if available) OR "take all alnums from sql conditions"
I'm doing a wrapper around splink so basically this is how the fix looks at my user-level API (aka fixing it outside of Splink):
comparison_dict={
"output_column_name": "".join(ch
for ch in cond.sql_condition.replace(" ","__")
if ch.isalnum() or ch in ["_"]),
"comparison_description": cond.description,
from splink.
Related Issues (20)
- [Splink 4] Move key user-facing classes/functions to `__init__.py`? HOT 1
- Issue while calculating ROC - Decimal type not json serialisable HOT 1
- [Splink 4] Should the db api distinsuigh between creating a table, and collecting results
- [FEAT] Save out `SplinkDataFrame` metadata
- [FEAT] Linkage stats
- [BUG] Bug in `truth_space_table_from_labels_column`/Calculations with Blocking Rules in Splink HOT 2
- [Splink4] Use fresh SQLPipeline for all linker methods HOT 4
- Bug in save model to JSON
- [FEAT] Internally estimate probabilities for blocking-rule-related comparisons to improve EM
- [FEAT] Allow exact or Bayesian pre-specification of m-probabilities for selected comparisons HOT 3
- [MAINT] Add a default value to the `threshold_selection_tool` chart
- Sqlglot 23.0.0 breaks EM Training HOT 2
- ERROR - IndexError: list index out of range HOT 1
- IndexError: List index out of range when calling linker.estimate_parameters_using_expectation_maximisation(training_blocking_rule) HOT 2
- Unable to retrieve m and u Estimates from the Saved Model
- [Splink 4] Find new matches can be simplified by creating a new linker
- [FEAT] Add GitHub action to sort/update custom dictionary HOT 3
- [FEAT] Split out system installs from spellchecker bash script HOT 2
- [MAINT] Ensure consistent capitalisation when referencing functions named after people
- [FEAT] Scala 2.13 support? HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from splink.