Is your proposal related to a problem? We may sometimes have prior

[FEAT] Allow exact or Bayesian pre-specification of m-probabilities for selected comparisons about splink HOT 3 OPEN

samkodes commented on June 20, 2024

[FEAT] Allow exact or Bayesian pre-specification of m-probabilities for selected comparisons

from splink.

Comments (3)

RobinL commented on June 20, 2024

Thanks - lots of interesting ideas here.

I'll comment here on the simpler part because it should be relatively straightforward to implement : allowing the user to fix m or u probabilities on any ComparisonLevel which then do not vary during EM training. Hence allowing some control over 'guiding' EM training.

It also feels like something that should be allowed - we've just never got around to implementing.

In Splink 4, we have a new and more general syntax for configuring. each ComparisonLevel, so it'd look something like this:

cll.ExactMatchLevel("hello").configure(
    m_probability=0.9, fix_m_during_training=True
)

{'sql_condition': '"hello_l" = "hello_r"',
 'label_for_charts': 'Exact match on hello',
 'm_probability': 0.9,
 'fix_m_during_training`: True}

In terms of where to look in the codebase, you've might have worked this out already but relevant parts may be:
m_probability
_populate_m_u_from_trained_values

I agree the 'sum to 1' constraint is a potentially fiddly aspect to this!

from splink.

samkodes commented on June 20, 2024

Thanks - looks a lot quicker than the old way and makes it easy to extend with new options. Probably OK to set m-probs individually and trust user to make sum-to-1 - I'm not sure if violating that will have any effect on EM.

How about a similar ability to set generic properties for the Comparison configurator? For example, a Dirichlet prior could be set this way... then all it would take for the Bayesian approach would be a few changes to compute_new_parameters_sql to implement the conjugate posterior updating rule (just add some extra "virtual observation" counts - i.e. the corresponding prior parameter - to each comparison level with a CASE statement).

from splink.

samkodes commented on June 20, 2024

I came across an alternative that may also be useful and should be relatively easy to implement. The approach is in an old paper by Winkler, Improved Decision Rules in the Fellegi-Sunter Model of Record Linkage (1993).

The basic idea is to allow the user to specify a set of convex constraints on the parameters to be estimated by EM. While the idea was driven by experiments with non-independent (e,g loglinear) models, they could also be applied to simpler independent models. An example of a convex constraint that could be easily specified and which would be useful is a set of linear inequalities on the m-probabilities; for example requiring "all others" probabilities for some comparisons to be < 20%.

The constraint is enforced during each round of EM by checking to ensure that the new estimate for each parameter is in the allowed region. If not, we choose a point on the line segment connecting it and the previous estimate; for the likelihoods under consideration, a theorem guarantees that the likelihood will be greater than that for the prior estimate for any point on the line segment. For simplicity, we can choose the point on the line segment on the boundary of the allowed region. Since the likelihood has increased, the EM should still converge with this process.

The constraint checking and line search for the boundary of the region could be done pretty easily in Python in the EM loop.

from splink.

[FEAT] Allow exact or Bayesian pre-specification of m-probabilities for selected comparisons about splink HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent