Git Product home page Git Product logo

matchit's Introduction

matchit

Stata ADO that matches two columns or two datasets based on similar text patterns

Syntax

Data in two columns in the same dataset

 matchit varname1 varname2 [, options]

Data in two different datasets (with indexation)

 matchit idmaster txtmaster using filename.dta , idusing(varname) txtusing(varname) [options]

Options

similmethod(simfcn)

String matching method. Default is bigram. Other typical built-in simfcn are:ngram, ngram_circ, token, soundex and token_soundex.

score(scrfcn)

Specifies similarity score. Default is jaccard. Other built-in options are simple and minsimple.

weights(wgtfcn)

Weighting transformation. Default is noweights. Other built-in options are simple, log and root.

generate(varname)

Specifies the name for the similarity score variable. Default is similscore.

Required options

Two datasets setup:

idmaster

Numeric varname from current file (masterfile).
Needs not to uniquely identify observations from masterfile (although recommended).

txtmaster

String varname from current file (masterfile) which will be matched to txtusing.

using filename

Name (and path) of the Stata file to be matched (usingfile).

idusing(varname)

Numeric varname from usingfile.
Needs not to uniquely identify observations from usingfile (although recommended).

txtusing(varname)

String varname from usingfile which will be matched to txtmaster.

Advanced options

wgtfile(filename)

Allows loading weights from a Stata file, instead of computing it from the current dataset (and using dataset, in the case of two-dataset setup). Default is not to load weights.

time

Outputs time stamps during the execution. To be used for benchmarking purposes.

flag(step)

Controls how often matchit reports back to the output screen.
Only really useful for optimizing indexation by trying different simfcn.
Default is step = 20 (percent).

Advanced options only for two datasets syntax:

threshold(num)

Lowest similarity scores to be kept in final results. Default is num = .5.

override

Ignores unsaved data warning.

diagnose

Reports a preliminary analysis about indexation.
To be used for optimizing indexation by cleaning original data and trying different simfcn.

stopwordsauto

Generates list of stopwords automatically.
It improves indexation speed but ignores potential matches.

swthreshold(grams-per-observation)

Only valid with stopwordsauto.
It sets the threshold of grams per observation to be included in the stopwords list.
Default is grams-per-observation = .2.

Description

matchit provides a similarity score between two different text strings by performing many different string-based matching techniques. It returns a new numeric variable (similscore) containing the similarity score, which ranges from 0 to 1. A similscore of 1 implies a perfect similarity according to the string matching technique chosen and decreases when the match is less similar. similscore is a relative measure which can (and often do) change depending on the technique chosen. For more information on these techniques refer to Raffo & Lhuillery (2009).

These two variables can be from the same dataset or from two different ones. This latter option makes matchit a convenient tool to join observations when the string variables are not necessarily exactly the same. In other terms, it allows for the dataset currently in memory (called the master dataset) to be matched with filename.dta (called the using dataset) by means of a fuzzy similarity between string variables of each dataset. In this case, matchit returns a new dataset containing five variables: two from the master dataset (idmaster and txtmaster), two from the using dataset (idusing and txtusing) and the already mentioned similarity score (similscore).

matchit is particularly useful in two cases: (1) when the two columns/datasets have different patterns for the same string data (e.g. individual or firm names, addresses, etc.); and, (2) when one of the datasets is considerably large and it was feeded by different sources, making it not uniformly formatted (e.g. names or addresses in different orders). Joining data in cases like these may lead to several false negatives when using merge or similar commands.

matchit is intended for overcoming this kind of problems without engaging into extensive data cleaning or correction efforts. Take, for instance, a case like (1) where one dataset contains first and last names in separated fields, while the other one has just a fullname field. The use of matchit allows to join the two datasets by simply combining the two fields of the first dataset without caring about the order of first and last names or about missing middle names. Similarly, a typical example of (2) is a large dataset containing addresses entered as free-text by different people. Using matchit you can join them with a more standardized source without caring if the zip or state codes were added systematically or not.

Please, note that matchit is case-sensitive. It also takes into account all other symbols (as far as Stata does). While data cleaning is not needed for using matchit, it often implies an improvement of the similarity scores and, in consequence, the overall quality of the matching exercise. However, too much data cleaning might remove relevant information, inducing a negative effect on quality due to false positives.

matchit requires freqindex to be installed when computing weights.

matchit's People

Contributors

julioraffo avatar

Stargazers

Uliana Filatova avatar Carl Chen avatar Stefano H. Baruffaldi avatar

Watchers

James Cloos avatar  avatar

Forkers

jingyayou

matchit's Issues

[QUESTION] bigram matching on strings of mixed length

Hi Julio, hope all is well! Your program has been immensely helpful in our ongoing study. I've been thinking of alternative implementations, particularly partial matching, and so I've begun a loose replication of your efforts in Python. When recreating your base case of bigram matching using Jaccard similarity, I noticed my scores can often vary slightly when it comes to comparing strings of mixed lengths.

My question is, does the order of the bigrams matter, or is it only important that they are properly matched (e.g. "jo" in "john smith" vs "smith john")? If the latter, how is a case like "jojohn smith" vs "smith john" treated, where there are two of the same bigrams in one string of different length? Can duplicate grams be matched to the same gram of another string? I would assume no, but controlling for this seems to give me differing results.

Using my earlier example, you have the following sets of moving bigrams:

jo oj jo oj jn n_ _s sm mi it th
sm mi it th h_ _j jo oh hn

If we leave the duplicate "jo" but assume it is unmatched, then we match a total of 7 grams, over sets of length 9 and 11 grams. That should give us 7/SQRT(9*11) or 70.35%.

If we remove the duplicate "jo" completely, then we get 7/SQRT(9*10) or 73.79%. I see your stock algo returns a score of 73.96%, which seems like a small difference in the latter case, but I've observed a few situations where the difference is more than a few p.p.

Is there something here that I'm missing?

Question: simple example of stopwordsauto

Howdy,

Thank you for the great package. I think the help files are very clear with one exception. I do not really understand what stopwordsauto is doing. I also do not understand what swthreshold does either.

Question: could you give a practical example?

What I think / want it do be doing: I have two columns of company names I want to match (simple bilateral matching with one dataset). Many companies have "LLC" or "CORP" or "CORPORATION", and these parts of the name string are essentially useless to me (they do not actually help identify matches). I believe that stopwordsauto is seeing that there are so many of these strings -- e.g., "LLC" and "CORP" -- and ignoring these strings. More specifically, it is as if matchit is deleting "LLC" and "COPR" from every string from which it is present and then doing a standard matchit as if I did not specify stopwordsauto. The caveat being it might also be grabbing information that is useful; e.g., suppose there are many companies with USA in the title and USA is being removed or even just common letter groups such as "ING" or "TION".

I do not even have a guess what swthreshold is doing... Quote: It sets the threshold of grams per observation to be included in the stopwords list. Default is grams-per-observation = .2.

Thank you for your time,
Best,
Luke

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.