Comments (3)
Thanks - lots of interesting ideas here.
I'll comment here on the simpler part because it should be relatively straightforward to implement : allowing the user to fix m or u probabilities on any ComparisonLevel
which then do not vary during EM training. Hence allowing some control over 'guiding' EM training.
It also feels like something that should be allowed - we've just never got around to implementing.
In Splink 4, we have a new and more general syntax for configuring. each ComparisonLevel
, so it'd look something like this:
cll.ExactMatchLevel("hello").configure(
m_probability=0.9, fix_m_during_training=True
)
{'sql_condition': '"hello_l" = "hello_r"',
'label_for_charts': 'Exact match on hello',
'm_probability': 0.9,
'fix_m_during_training`: True}
In terms of where to look in the codebase, you've might have worked this out already but relevant parts may be:
m_probability
_populate_m_u_from_trained_values
I agree the 'sum to 1' constraint is a potentially fiddly aspect to this!
from splink.
Thanks - looks a lot quicker than the old way and makes it easy to extend with new options. Probably OK to set m-probs individually and trust user to make sum-to-1 - I'm not sure if violating that will have any effect on EM.
How about a similar ability to set generic properties for the Comparison configurator? For example, a Dirichlet prior could be set this way... then all it would take for the Bayesian approach would be a few changes to compute_new_parameters_sql
to implement the conjugate posterior updating rule (just add some extra "virtual observation" counts - i.e. the corresponding prior parameter - to each comparison level with a CASE statement).
from splink.
I came across an alternative that may also be useful and should be relatively easy to implement. The approach is in an old paper by Winkler, Improved Decision Rules in the Fellegi-Sunter Model of Record Linkage (1993).
The basic idea is to allow the user to specify a set of convex constraints on the parameters to be estimated by EM. While the idea was driven by experiments with non-independent (e,g loglinear) models, they could also be applied to simpler independent models. An example of a convex constraint that could be easily specified and which would be useful is a set of linear inequalities on the m-probabilities; for example requiring "all others" probabilities for some comparisons to be < 20%.
The constraint is enforced during each round of EM by checking to ensure that the new estimate for each parameter is in the allowed region. If not, we choose a point on the line segment connecting it and the previous estimate; for the likelihoods under consideration, a theorem guarantees that the likelihood will be greater than that for the prior estimate for any point on the line segment. For simplicity, we can choose the point on the line segment on the boundary of the allowed region. Since the likelihood has increased, the EM should still converge with this process.
The constraint checking and line search for the boundary of the region could be done pretty easily in Python in the EM loop.
from splink.
Related Issues (20)
- [Splink4] Use fresh SQLPipeline for all linker methods HOT 4
- Bug in save model to JSON
- [FEAT] Internally estimate probabilities for blocking-rule-related comparisons to improve EM
- [MAINT] Add a default value to the `threshold_selection_tool` chart
- Sqlglot 23.0.0 breaks EM Training HOT 2
- ERROR - IndexError: list index out of range HOT 1
- IndexError: List index out of range when calling linker.estimate_parameters_using_expectation_maximisation(training_blocking_rule) HOT 2
- Unable to retrieve m and u Estimates from the Saved Model
- [Splink 4] Find new matches can be simplified by creating a new linker
- [FEAT] Add GitHub action to sort/update custom dictionary HOT 3
- [FEAT] Split out system installs from spellchecker bash script HOT 2
- [MAINT] Ensure consistent capitalisation when referencing functions named after people
- [FEAT] Scala 2.13 support? HOT 4
- Can't train for M values on Databricks HOT 4
- [FEAT] Rename cols in graph metric tables
- [FEAT] Add cluster metrics to cluster studio
- Allow `__splink__df_concat` to be computed without `linker` HOT 1
- M values aren't trained for a column HOT 3
- `linker.estimate_u_using_random_sampling` fails with default arguments, with no clear indication why HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from splink.