Comments (3)
@RobinL, I've quickly drafted up a potential solution to address this issue, available for review here.
The proposed changes include:
-
Removal of Templated Names from the Cache: For example, the name
__splink__df_concat_with_tf
will no longer appear in the cache. -
Addition of a Method for Templated Name Checks: This method sequentially checks a list of provided templated names. This means that if you wish to search for both
tf_first_name
andconcat_with_tf
;__splink_df_tf_first_name
is return immediately if found; otherwise we search for__splink_df_concat_with_tf
and return it if found. This functionality is only intended for use with term frequency tables to determine if they have been generated without requiring any SQL hashing. -
Term Frequency Table Registration Changes - Given the changes above, table registration will now overwrite the originally registered tf table. i.e. if
__splink__df_concat_with_tf
already exists (with any given hashed name), we will now overwrite this table with the table being registered.
It's important to note that this update is based on the assumption (which needs verification and may be incorrect) that in any single Splink session, only one term frequency table should be assigned per column or input nodes table.
Consequently, instead of generating two tables when running compute_tf_table
or initialise_df_concat_with_tf
, a single table will now be generated and its corresponding templated name will be utilised for cache references.
Term Frequency Caching Demo
Replace compute_tf_table(...)
with _initialise_df_concat_with_tf
to check the input nodes version.
from splink.duckdb.duckdb_linker import DuckDBLinker
import pandas as pd
from tests.basic_settings import get_settings_dict
df = pd.read_csv("./tests/datasets/fake_1000_from_splink_demos.csv")
linker = DuckDBLinker(
df,
get_settings_dict(),
)
# Generates our primary tf table for 'first_name'
tf_table = linker.compute_tf_table("first_name").as_pandas_dataframe()
linker.register_term_frequency_lookup(tf_table, "first_name") # errors
linker.register_term_frequency_lookup(tf_table, "first_name", overwrite=True) # registers
from splink.
Link to our discussion about this so it doesn't get lost:
https://asdslack.slack.com/archives/D02TEUPB6H3/p1700482818591709
from splink.
Closing as this is stale and we are content with the current caching methodology.
from splink.
Related Issues (20)
- [Splink4] Use fresh SQLPipeline for all linker methods HOT 4
- Bug in save model to JSON
- [FEAT] Internally estimate probabilities for blocking-rule-related comparisons to improve EM
- [FEAT] Allow exact or Bayesian pre-specification of m-probabilities for selected comparisons HOT 3
- [MAINT] Add a default value to the `threshold_selection_tool` chart
- Sqlglot 23.0.0 breaks EM Training HOT 2
- ERROR - IndexError: list index out of range HOT 1
- IndexError: List index out of range when calling linker.estimate_parameters_using_expectation_maximisation(training_blocking_rule) HOT 2
- Unable to retrieve m and u Estimates from the Saved Model
- [Splink 4] Find new matches can be simplified by creating a new linker
- [FEAT] Add GitHub action to sort/update custom dictionary HOT 3
- [FEAT] Split out system installs from spellchecker bash script HOT 2
- [MAINT] Ensure consistent capitalisation when referencing functions named after people
- [FEAT] Scala 2.13 support? HOT 4
- Can't train for M values on Databricks HOT 4
- [FEAT] Rename cols in graph metric tables
- [FEAT] Add cluster metrics to cluster studio
- Allow `__splink__df_concat` to be computed without `linker` HOT 1
- M values aren't trained for a column HOT 3
- `linker.estimate_u_using_random_sampling` fails with default arguments, with no clear indication why HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from splink.