Git Product home page Git Product logo

Comments (3)

RobinL avatar RobinL commented on June 20, 2024

Thanks - lots of interesting ideas here.

I'll comment here on the simpler part because it should be relatively straightforward to implement : allowing the user to fix m or u probabilities on any ComparisonLevel which then do not vary during EM training. Hence allowing some control over 'guiding' EM training.

It also feels like something that should be allowed - we've just never got around to implementing.

In Splink 4, we have a new and more general syntax for configuring. each ComparisonLevel, so it'd look something like this:

cll.ExactMatchLevel("hello").configure(
    m_probability=0.9, fix_m_during_training=True
)
{'sql_condition': '"hello_l" = "hello_r"',
 'label_for_charts': 'Exact match on hello',
 'm_probability': 0.9,
 'fix_m_during_training`: True}

In terms of where to look in the codebase, you've might have worked this out already but relevant parts may be:
m_probability
_populate_m_u_from_trained_values

I agree the 'sum to 1' constraint is a potentially fiddly aspect to this!

from splink.

samkodes avatar samkodes commented on June 20, 2024

Thanks - looks a lot quicker than the old way and makes it easy to extend with new options. Probably OK to set m-probs individually and trust user to make sum-to-1 - I'm not sure if violating that will have any effect on EM.

How about a similar ability to set generic properties for the Comparison configurator? For example, a Dirichlet prior could be set this way... then all it would take for the Bayesian approach would be a few changes to compute_new_parameters_sql to implement the conjugate posterior updating rule (just add some extra "virtual observation" counts - i.e. the corresponding prior parameter - to each comparison level with a CASE statement).

from splink.

samkodes avatar samkodes commented on June 20, 2024

I came across an alternative that may also be useful and should be relatively easy to implement. The approach is in an old paper by Winkler, Improved Decision Rules in the Fellegi-Sunter Model of Record Linkage (1993).

The basic idea is to allow the user to specify a set of convex constraints on the parameters to be estimated by EM. While the idea was driven by experiments with non-independent (e,g loglinear) models, they could also be applied to simpler independent models. An example of a convex constraint that could be easily specified and which would be useful is a set of linear inequalities on the m-probabilities; for example requiring "all others" probabilities for some comparisons to be < 20%.

The constraint is enforced during each round of EM by checking to ensure that the new estimate for each parameter is in the allowed region. If not, we choose a point on the line segment connecting it and the previous estimate; for the likelihoods under consideration, a theorem guarantees that the likelihood will be greater than that for the prior estimate for any point on the line segment. For simplicity, we can choose the point on the line segment on the boundary of the allowed region. Since the likelihood has increased, the EM should still converge with this process.

The constraint checking and line search for the boundary of the region could be done pretty easily in Python in the EM loop.

from splink.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.