Git Product home page Git Product logo

fastlink's People

Contributors

bfifield avatar kosukeimai avatar tedenamorado avatar tpaskhalis avatar weekend-warrior avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fastlink's Issues

Custom/adjusted string comparison functions

Greatly appreciate the work on this package. Our data deals with a wide diversity of names (Hispanic, Asian, etc.), and we've found that the string distance methods included with fastLink have occasional issues:

  • Masculine and feminine versions of names being matched, like "Francisco" and "Francisca"
  • Generation pairs being matched, like "Willie Jr" and "Willie" (although this may potentially be resolved by proper data cleaning)
  • Transpositions of names are difficult to match, which is not uncommon with Asian names where the family name comes first

The first case is especially troublesome, since Jaro-Winkler tends to empahsize the initial characters and the edit distance is very small. Is it possible to implement custom string comparison functions, or to adjust the current default options to account for these cases? It would also help to have a name reweighting option for last names, since we could downweight the posterior matches of very common last names and reduce false positives. Thank you!

NA values create error in getMatches()

I got an error when running getMatches(). It was caused by this line of code in getMatches():

dfA$dedupe.ids <- out_df$id_2

The above line can't run when the number of rows in dfA is different from the number of rows in out_df.

I traced back the cause of the error and found it was because there were NAs in fl.out$posterior. And further I found one NA value in one of the columns in dfA.

So, I speculate the NA in the input data resulted in NAs in fl.out$posterior, and then they failed getMatches().

I suggest the authors to consider adding a warning message to fastLink(), or replacing NA values in the input data with non-NA values such as empty strings.

question / documentation

hi, this package seems great and much needed.

i had 2 questions, whose answers could strengthen the documentation / articles / vignettes.

  1. It seems like the workflow is designed to separate the 'blocking' step from the later 'estimation' step.
    Is this correct?

This would be a point to highlight as a strength.

one pain point in using the other package RecordLinkage
https://cran.r-project.org/web/packages/RecordLinkage/index.html
is that the blocking step was tied to the estimation step in a single overarching function
When I was using the record linkage package, i wish it was broken out into two functions

form_blocks()
estimate_params()

  1. Are the performance metrics output by the summary() method, 'conditional' or 'unconditional?

In the sense that, do the metrics apply on cases after blocking, or do the metrics apply on cases unconditional on the blocking?

"Sensitivity (%)",
 "Specificity (%)",
 "Positive Predicted Value (%)",
 "False Positive Rate (%)",
 "False Negative Rate (%)",
 "Correctly Clasified 

This was another pain point of RecordLinkage
in that the performance metrics only applied to conditional cases after blocking

cann't run gammaCKpar on a simple comparion instance

Hi,

Please check the code as follows which returned

Error in { : task 1 failed - "subscript out of bounds"

dfA=list(c('A2 Infant','Express Logistics'))
names(dfA) = c('name')
dfB=list(c('A2 Infant','Cargo Solutions'))
names(dfB) = c('name')
g_firstname <- gammaCKpar(dfA$name, dfB$name)

But when shifting to function gammaCK2par

it could work.

Could you please advise how to fix the code to recall function gammaCKpar well on the instance above?

Cheers

Weighting the matching fields

Hi Guys,

Nice work with the package.

Currently using the package to link addresses in our company database to the Government National Address File (GNAF)(https://data.gov.au/data/dataset/19432f89-dc3a-4ef3-b943-5326ef1dbecc) which contains >14million addresses in Australia.

Aside from the size of the GNAF (which requires clustering to manage), the challenge has been weighting the fields in relative importance. Given this is all spatial, the value of matching Postcode > Suburb > Street name > house number in determining matches, however the weighting seems to be even. Is there any way of influencing the weighting placed on the various fields?

For example the address that I am matching is 8, Unique, Road, Town, 7000, South Australia, row 1 in the following table.

Num Scenario StreetNumber StreetName StreetType Suburb Postcode State
1 Database 8 unique street orange 7000 south australia
2 GNAF 9 unique street apples 7000 south australia
3 Fastlink match 8 other street pears 7000 south australia
  match status match nomatch match no match match match

I can see a few addresses in the GNAF on Unique street (e.g. row 2), however the Town/suburb and StreetNumber is incorrect in my database, going by the authority that is the GNAF. Fastlink is picking up as a match an address that has the same number, different street in a whole other suburb (row 3). By my count row 3 has 4 matches and 2 non matches (to row 1), equal to a result as if it had picked row 2 to match row 1.

Is there anyway of prioritising the StreetName or any other field over the StreetNumber.

The subset of GNAF in South Australia has 440k rows while the South Australian subset of my database is 164 rows. The size of the 164 is influenced by the fact that 90% of my database can be identically matched, the 164 are the problematic ones with varying data quality issues.

Run times for field comparison variables

Hi! I've been using fastLink to do some matching projects for work.
I've noticed that some field comparisons take way way longer than others (maybe not even finishing?)

For example, on matching a datasets with ~ 300k entries with another dataset with ~1.4 million entries, name comparison takes ~ 10 minutes, but address comparison takes hours, both using gammaCKpar with the same cutoffs. Sometimes I just have to kill the address comparison completely.

Do you have any suggestions on why this might be taking so long, or what I could do to make it go faster?

Some improvements

  • Faster implementation of stringdist
  • Precalculate string distances of common first and last names
  • Remove as much of duplication between tableCounts and matchesLink as possible: Solved.
  • Add a user-friendly warning for #17: Solved.
  • Add a user-friendly warning for #18: Solved.
  • Add a user-friendly warning for #20: Solved.

Measure distance to nearest group

Hi there,

I want to find items that aren't matched but were just under the threshold for matching with a group. Is there a way to do this?

Reproducing RLdata500 deduplication results

Hi,

First of all thank your work on fastLink. I've been trying to recreate your results from your paper:

Steorts (2015) reports that the FNR and FDR of her methodology are 0.02 and 0.04, respectively. When applying our algorithm, we use three categories for the string valued variables (first and last names), i.e., exact (or nearly identical) match, partial match, and disagreement, based on the Jaro-Winkler string distance with 0.94 and 0.88 as the cutpoints as recommended by Winkler (1990). For the numeric valued fields (day, month, and year of birth), we use a binary comparison, based on exact matches. Using fastLink, we found both FNR and FDR of our methodology to be exactly zero.

I've included a reproducible example below:

library(dplyr)
library(fastLink)
data("RLdata500", package = "RecordLinkage")

matching_vars <- names(RLdata500)
matches <- fastLink(
  dfA                     = RLdata500, 
  dfB                     = RLdata500,
  varnames           = matching_vars[c(1, 3, 4, 5, 6, 7)],
  stringdist.match = matching_vars[c(1, 3, 4)],
  partial.match      = matching_vars[c(1, 3, 4)]
  )
 
matches$matches
# inds.a inds.b
#1     59     59
#2    173    173
#3    219    219
#4    336    336
#5    360    360
#6    419    419
#7    455    455
#8    479    479

dups <- matches$matches %>% 
    filter(inds.a != inds.b) %>% 
    inner_join(rl_data, by = c("inds.a" = "n")) %>% 
    inner_join(rl_data, by = c("inds.b" = "n"))

As you can see I've made no changes to the default settings of fastLink, The rl_dups dataframe contains only exact matches and discovers no duplicates. Do you have any suggestions in order to recreate your results? I'm assuming I just need to tweak some parameters?

Thanks,
Dan

Running fastLink on several cores/threads on

Hi,

first off, thanks for the terrific package, it has helped me a lot in my research.

A quick question out of curiosity: I'm trying to run fastlink() on an ARM cluster computer with 64 cores. However, no matter how many cores I specify through n.cores, the system always tells me:

Parallelizing calculation using OpenMP. 1 threads out of 64 are used.

So, it's using only 1 thread of the many more available which is very inefficient.

Digging through the source code of the package I found that some of the functions check whether or the OS is a MacOS by including the line if(Sys.info()[['sysname']] == 'Darwin') (e.g. here ). I was wondering whether that's what's causing the functions to be limited to only one core? And, if so, whether that OS restriction is necessary due to the way how the parallel package does things?

In any case, thanks again for your terrific software.

Felix

getMatches() warning / deduplication

Hi,

thanks for the great package & the terrific documentation. It's extremely useful and I think many people are going to use it--I definitely will!

I'm having issues when deduplicating a dataset, where I have only few variables to match on. The goal is to generate a common ID for observations with a very similar string identifier. When I follow the deduplication procedure you sketch in the README, but remove all numeric variables and keep only a handful of the string variables, the getMatches() function produces a warning and the common ID is scrambled up.

Reproducible Example

Here's a reproducible example to illustrate the problem:

library(tidyverse)
library(fastLink)

## Add duplicates
set.seed(123)

dfA <- fastLink::dfA
dfA <- rbind(dfA, dfA[sample(1:nrow(dfA), 10, replace = FALSE),])

## Run fastLink
fl_out_dedupe <- fastLink(
  dfA = dfA, dfB = dfA,
  varnames = c("firstname", "lastname", "city")
)

## Run getMatches
dfA_dedupe <- getMatches(dfA = dfA, dfB = dfA, fl.out = fl_out_dedupe)

## Look at the IDs of the duplicates
names(table(dfA_dedupe$dedupe.ids)[table(dfA_dedupe$dedupe.ids) > 1])

## Show duplicated observation
dfA_dedupe[dfA_dedupe$dedupe.ids == 4,]

This is basically the code from the "deduplication" section from the README file, but I've simply removed most of the matching variables to only three, firstname, lastname, city.

Problem description

Running

dfA_dedupe <- getMatches(dfA = dfA, dfB = dfA, fl.out = fl_out_dedupe)

however, results in the following message:

Warning message:
In dfA$dedupe.ids[dfA$dedupe.ids %in% id.original] <- id.duplicated :
  number of items to replace is not a multiple of replacement length

When we look at the resulting data frame, it's clear that the matched IDs are somehow wrongly assigned:

> dfA_dedupe[dfA_dedupe$dedupe.ids == 4,]
   firstname middlename lastname housenum   streetname          city birthyear dedupe.ids
4     joseph      clyde  mcnulty    30436     49th ave Castro Valley      1961          4
38     david       <NA>  johnson     5300  kilkenny pl       Oakland      1960          4

Clearly, joseph and david don't match on any of the chosen variables.

User-written function works

Interestingly, this problem seems connected to #36. In #36, @mbcann01 provides a user-written function to extract matched pairs from the fastLink-object.

Specifically, if we run the fmr_add_unique() function provided in #36 and follow the procedure described there, we can retrieve the correct IDs.

# run the code from #36 that generates the "fmr_add_unique_id()" function first

# generate group ID
dfA <- unite(dfA, "group", firstname:birthyear, remove = F) 
   
# extract matches  / 'fl_out_dedupe' comes from the code junk above
dfA_dedupe_user <- fmr_add_unique_id(dfA, fl_out_dedupe)

# join with original data
dfA_w_id <- dfA %>% 
 dplyr::left_join(
   dfA_dedupe_user %>% 
     dplyr::select(id, group), 
   by = "group") %>% 
 dplyr::select(id, dplyr::everything(), -group)

dfA_w_id[dfA_w_id$id == 3, ]

gives us

> dfA_w_id[dfA_w_id$id == 3, ]
  id firstname middlename lastname housenum    streetname          city birthyear
4  3    joseph      aaron   joseph     4547  piedmont ave Castro Valley      1948
5  3    joseph      clyde  mcnulty    30436      49th ave Castro Valley      1961

where id indicates the common ID for duplicated matches (similar to dedupe.ids). Here the correct (i.e. the most similar) josephs are matched. (I know they're not the "correct" match, but the goal was to find the most similar ones.)

Summary

Since the user-written function from #36 correctly retrieves the IDs for the most similar matches, the problem lies somehow in the construction of the deduplicated data.frame from getMatches(), and not in the matching process itself.

Let me know if I can provide any additional info / code to help you fix this issue--if there is indeed an issue and I'm not doing something wrong here.

Thanks again for your work!

Best
Felix

Question - Matching dataset against itself

Hi,

This is a fantastic package. Thanks to you all.

One of the problems i'm trying to solve is determining duplicates within a single dataset. Do you have any ideas on how to use fastLink to ID same-set duplicates?

Cheers,
Ewen

Q: Database size limit for duplicate removal?

Hi,

I am trying to perform deduplication on a database with 1.8M records. The analysis has been running for ~10 days on a 8-core machine with 32Gb RAM. Do you believe this task can be achieved on such a machine or do I need a bigger server?

My command is as follows:

fl_pac_dedup <- fastlink(
    dfA = pacientes_clean, dfB = pacientes_clean,
    varnames = c("pacient_name", "mother_name", "birth_date"),
    stringdist.match = c("pacient_name", "mother_name", "birth_date"),
    dedupe.matches = FALSE, return.all = FALSE
)

Best,
Gui

Using partial.match argument results in different matches

I was attempting a simple record linkage as follows, and I noticed something very strange. When I specify the partial.match argument, the outcome of the match is different from when it is not specified.

In the following made-up data, there are four entities in each dataframe. Entities 2, 3, and 4 are same people. 2 and 4 are exactly the same records. For 3, the address changes.

library(tidyverse)
library(fastLink)

dfA_synthetic <- data_frame(
  NameLast = c("Kim", "Lee", "Park", "Choi"), 
  NameFirst = c("Julie", "Joanna", "Jessica", "Jennifer"), 
  Address = c("500 E 6th St",  "400 W 5th Rd", 
              "100 S Main St", "200 N Main St"), 
  City = c("Santa Ana", "Laguna Hills", "Fullerton", "Pasadena"), 
  StreetName = c("6th", "5th", "Main", "Main"), 
  StreetSuffix = c("St", "Rd", "St", "St"), 
  MailAddress1 = c("500 E 6th St",  "400 W 5th Rd", 
                   "100 S Main St", "200 N Main St"), 
  MailAddress2 = c("Santa Ana CA 92701", "Laguna Hills CA 92653", 
                   "Fullerton CA 92831", "Pasadena CA 91106"), 
  Phone = c("", "", "(626)395-4701", "(626)529-3219"), 
  BirthDate = c("01/01/1988", "02/02/1977", "03/03/1999", "04/04/2000")
)

dfB_synthetic <- data_frame(
  NameLast = c("Hong", "Lee", "Park", "Choi"), 
  NameFirst = c("Jean", "Joanna", "Jessica", "Jennifer"), 
  Address = c("600 S Catalina St",  "400 W 5th Rd", 
              "100 S Main St", "200 N Main St"), 
  City = c("Los Angeles", "Laguna Hills", "Fullerton", "Pasadena"), 
  StreetName = c("6th", "5th", "Main", "Main"), 
  StreetSuffix = c("Dr", "Rd", "St", "St"), 
  MailAddress1 = c("600 S Catalina Dr",  "400 W 5th Rd", 
                   "PO Box 3000", "200 N Main St"), 
  MailAddress2 = c("Pasadena 91125", "Laguna Hills CA 92653", 
                   "Anaheim CA 92800", "Pasadena CA 91106"), 
  Phone = c("", "", "(626)395-4701", "(626)529-3219"), 
  BirthDate = c("09/09/1966", "02/02/1977", "03/03/1999", "04/04/2000")
)

m_synthetic_1 <- fastLink(
  dfA = dfA_synthetic, dfB = dfB_synthetic, 
  varnames = names(dfA_synthetic), 
  stringdist.match = names(dfA_synthetic), 
  partial.match = names(dfA_synthetic)
)

m_synthetic_2 <- fastLink(
  dfA = dfA_synthetic, dfB = dfB_synthetic, 
  varnames = names(dfA_synthetic), 
  stringdist.match = names(dfA_synthetic) ## , 
  ## partial.match = names(dfA_synthetic)
)

Then,

m_synthetic_1$matches$inds.a
[1] 3 4
> m_synthetic_2$matches$inds.a
[1] 2 3 4

From what I read from the manual and the GitHub README.md,

partial.match is another vector of variable names present in both stringdist.match and varnames. A variable included in partial.match will have a partial agreement category calculated in addition to disagreement and absolute agreement, as a function of Jaro-Winkler distance.

I understood partial.match to be returning an extra summary stats of some sort to show the degree of agreement on specified variables---something that should not change the match itself. Am I not correctly understanding the function?

p.s. This is something separate, and just a small suggestion, but I feel that data(samplematch) might not be the best data to demonstrate the strength of the package for those who first check out the package, because with the samplematch's dfA and dfB, you can simply call inner_join and get the same output much faster. Maybe a revised dataset, as follows:

dfA %<>%
  dplyr::mutate_if(is.factor, as.character) %>%
  dplyr::mutate(middlename = ifelse(lastname == "weatherspoon", NA, middlename))
dfB %<>% dplyr::mutate_if(is.factor, as.character)
matches.out.revised <- fastLink(
  dfA = dfA, dfB = dfB, 
  varnames = c("firstname", "middlename", "lastname", "housenum", "streetname", "city", "birthyear"),
  stringdist.match = c("firstname", "middlename", "lastname", "streetname", "city"),
  partial.match = c("firstname", "lastname", "streetname"), 
  cut.p = 0.8, threshold.match = 0.8
)

question - view posterior for all pairs of rows

Hey guys,

Thank you for developing this package. I think it will be really useful to me. It seems like the default behavior for fastLink::fastLink is to return the indices and posterior probabilities for matching rows only.

Is there a way to view the posterior probabilities for ALL pairs of rows?

I'm interested in knowing if there were any pairs that were NOT matched, but should be.

Thank you,
Brad

Enhancement - Exact Matching Option

Hi! Thanks for this wonderful contribution. As soon as I am able I will attempt a push but I wanted to get your feedback on the option to assign exact matches, even if probability of matching is too low. I am running into cases where my data is too small to make a probabilistic match and, though exact matches exist, the EM step does not output the indices despite noting them in the summary output.

Thanks in advance for your time and consideration.

New feature request: Matthews correlation coefficient

Please add Matthews correlation coefficient (MCC) as an additional statistic for the confusion table:

      TP * TN - FP * FN
MCC = -----------------------------------------------------
      [(TP + FP) * (FN + TN) * (FP + TN) * (TP + FN)]^(1/2)

The MCC is useful as an overall measure of the linkage quality. The MCC is better than Accuracy and the F1-score for imbalanced data because it adjusts for the balance ratios of the four confusion table categories (TP, TN, FP, and FN). In practice, I find that most linkage data are imbalanced by having mostly TN.

Wikipedia: https://en.wikipedia.org/wiki/Matthews_correlation_coefficient
Matthew's article (1975): https://doi.org/10.1016/0005-2795(75)90109-9

Matthew, page 445:

"A correlation of:
   C =  1 indicates perfect agreement,
   C =  0 is expected for a prediction no better than random, and
   C = -1 indicates total disagreement between prediction and observation".

Mentioned in Tharwat's article (2018): https://doi.org/10.1016/j.aci.2018.08.003
Recommended by Luque et al (2019): https://doi.org/10.1016/j.patcog.2019.02.023

Anders

Small sample errors

Hello all,

Thank you for the amazing package again. fastLink works very well with moderate to large sized datasets---when faced with extremely small sample, it sometimes breaks down. For instance, with the commit on Sep 5, 2018 on gamma functions, the following works:

library(fastLink)
data(samplematch)
## One underlying true match
matches.out <- fastLink(
  dfA = dfA[c(1, 3), ], dfB = dfA[c(1, 2), ], 
  varnames = c("firstname", "middlename", "lastname", "housenum", "streetname", "city", "birthyear"),
  stringdist.match = c("firstname", "middlename", "lastname", "streetname", "city"),
  partial.match = c("firstname", "lastname", "streetname")
)

The following still gives an indexing error (no underlying true matches):

## No underlying true match
matches.out <- fastLink(
  dfA = dfA[c(1, 3), ], dfB = dfA[c(4, 2), ], 
  varnames = c("firstname", "middlename", "lastname", "housenum", "streetname", "city", "birthyear"),
  stringdist.match = c("firstname", "middlename", "lastname", "streetname", "city"),
  partial.match = c("firstname", "lastname", "streetname")
)

and in case that there happens to be only single observations in both dfA and dfB, ncol(patterns) - 1 is not correctly recognized from function emlinkMARmov (line 7), and the following also breaks:

## One underlying true match
matches.out <- fastLink(
  dfA = dfA[1, ], dfB = dfA[1, ], 
  varnames = c("firstname", "middlename", "lastname", "housenum", "streetname", "city", "birthyear"),
  stringdist.match = c("firstname", "middlename", "lastname", "streetname", "city"),
  partial.match = c("firstname", "lastname", "streetname") 
) 

As for the last case, it works when dfA = dfA[1, ] and dfB = dfA[c(1, 2), ], so I really don't know what's the issue here---it'll be great if there's a warning and an empty output with the same structure instead of failing, since the last setup doesn't make sense for a probabilistic matching anyway. Such small samples sometimes come up in a dynamic setting.

Sincerely,
Silvia

Running time

I had a question with the expected running time and computing capacity I need to plan to use fastLink. I am trying to run it on a database of 1.7M observations, only matching on two variables. However, so far (and the code has been running for 12h) I have not been able to run past the first task of calculating matches for each variable. So I was wondering whether this is to be expected and I should move to a cluster or whether this sounds weird and I am doing something wrong.
Thank you!

question / documentation

How do I get a table with the following information:

rownumber dfA: rownumbers of dfA
rownumber dfB: rownumbers of dfB that can be be linked to the corresponding rownumber of dfA
similarity measure: how well can the rownumber of dfB be linked to the corresponding rownumber of dfA?

E.g.: dfA has 10 rows; dfB has 5 rows; These dataframes can be linked as follows:

rownumber dfA rownumber dfB similarity measure
1 2 0.94
2 NA 0
3 NA 0
4 1 0.93
5 4 0.92
6 5 0.92
7 NA 0
8 NA 0
9 NA 0
10 3 0.98

Legend: rownumber 2 of dfB dan be linked to rownumber 1 of dfA. The similarity measure of this link is = 0.94.

replacement has 2 rows, data has 0

I'm getting an error when merging on two columns of names, and can't quite figure out why.

Here is a toy example that replicates the error. In this example increasing the number of observations solves the problem, but not in my real case. However, adding a third column seems to solve problem.

Would anyone know what's going on?

library(fastLink)

## setup toy data
nobs.a <- 30
set.seed(66455) # needs to be a particular draw to replicate error
dfA.0 <- data.frame(firstname = sample(c("JOHN", "GEORGE"), nobs.a, TRUE, c(0.7, 0.3)),
                    lastname = sample(c("MILLER", "HILL"), nobs.a, TRUE, c(0.7, 0.3)))

dfB.0 <- data.frame(firstname = rep(c("JOHN", "OLIVER", "CHARLES", "FRANCIS", "JOHN")),
                    lastname = rep(c("HILL", "YOUNG", "KEEL", "MCNEAL", "KOONS")))
## throws error 
fL.0 <- fastLink(dfA.0, dfB.0, varnames = c("firstname", "lastname"))
#> 
#> ==================== 
#> fastLink(): Fast Probabilistic Record Linkage
#> ==================== 
#> 
#> Calculating matches for each variable.
#> Getting counts for zeta parameters.
#> Running the EM algorithm.
#> Getting the indices of estimated matches.
#> Warning in min(em.obj$weights[em.obj$zeta.j >= l.t]): no non-missing
#> arguments to min; returning Inf
#> Deduping the estimated matches.
#> Error in `$<-.data.frame`(`*tmp*`, roworder, value = c(1L, 0L)): replacement has 2 rows, data has 0


## larger N solves the problem here, though it doesn't in my real data
dfA.1 <- data.frame(firstname = sample(c("JOHN", "GEORGE"), nobs.a*100, TRUE, c(0.7, 0.3)),
                    lastname = sample(c("MILLER", "HILL"), nobs.a*100, TRUE, c(0.7, 0.3)))
fL.1 <- fastLink(dfA.1, dfB.0, varnames = c("firstname", "lastname"))
#> 
#> ==================== 
#> fastLink(): Fast Probabilistic Record Linkage
#> ==================== 
#> 
#> Calculating matches for each variable.
#> Getting counts for zeta parameters.
#> Running the EM algorithm.
#> Getting the indices of estimated matches.
#> Deduping the estimated matches.

## adding a superflous third column solves the problem, even when largeN sample does not in my case
dfA.2 <- cbind(dfA.0, noise = c("noiseA", rep("noiseB", nobs.a - 1)))
dfB.2 <- cbind(dfB.0, noise = c("noiseA", rep("noiseB", nrow(dfB.0) - 1)))
fL.2 <- fastLink(dfA.2, dfB.2, varnames = c("firstname", "lastname", "noise"))
#> 
#> ==================== 
#> fastLink(): Fast Probabilistic Record Linkage
#> ==================== 
#> 
#> Calculating matches for each variable.
#> Getting counts for zeta parameters.
#> Running the EM algorithm.
#> Getting the indices of estimated matches.
#> Deduping the estimated matches.

[Accidentally Opened]

Sorry, this issue was opened by an errant key press. Even though I opened it, since it's not my repo, I can't delete it. Sorry!

Seemingly odd partial matching behavior

Hi there,

Thank you for writing, and especially maintaining, such a great package. I worry this is going to be a silly question -- and I apologize if that is the case.

I am trying to assign unique ids to a roster of names. It seems that some of the matches however are too inclusive and include, as in the example below, strings that seem to be too distinct from one another to be considered matches. At the bottom I am including a screenshot of an example of this on the full dataset so you can get a better sense of the range of different names that are considered matches.

I am guessing this behavior can be fixed by tweaking some parameters (even though the threshold matching level is > .9?), but I wanted to bring it up here too in case something else is going awry.

Thanks again,
Ben

library(tidyverse)
library(fastLink)

sample_data <- structure(list(fiscal_year = c(2011, 2020, 2021, 2012, 2016, 
2015, 2017, 2017, 2016, 2019, 2019, 2020, 2010, 2014, 2017, 2019, 
2020, 2016, 2015, 2013, 2010, 2010, 2009, 2011, 2009, 2013, 2018, 
2013, 2009, 2017, 2010, 2021, 2021, 2021, 2015, 2014, 2014, 2013, 
2013, 2017, 2013, 2014, 2016, 2012, 2011, 2019, 2016, 2013, 2017, 
2010, 2011, 2011, 2020, 2021, 2015, 2012, 2012, 2009, 2017, 2019, 
2009, 2014, 2021, 2015, 2011, 2021, 2019, 2017, 2012, 2014, 2009, 
2013, 2010, 2021, 2012, 2021, 2013, 2015, 2015, 2015, 2015, 2013, 
2018, 2010, 2011, 2014, 2011, 2015, 2014, 2013, 2016, 2012, 2012, 
2014, 2018, 2016, 2016, 2009, 2014, 2021, 2015, 2010, 2014, 2018, 
2021, 2019, 2010, 2020, 2017, 2009, 2010, 2014, 2018, 2009, 2020, 
2009, 2019, 2019, 2016, 2012, 2018, 2020, 2019, 2014, 2014, 2016, 
2019, 2010, 2015, 2021, 2012, 2013, 2014, 2009, 2015, 2016, 2016, 
2020, 2015, 2012, 2016, 2015, 2011, 2016, 2009, 2019, 2014, 2013, 
2021, 2019, 2020, 2016, 2019, 2010, 2014, 2020, 2021, 2013, 2016, 
2015, 2010, 2018, 2020, 2017, 2016, 2011, 2016, 2017, 2018, 2015, 
2010, 2012, 2019, 2020, 2021, 2016, 2020, 2014, 2016, 2016, 2009, 
2016, 2018, 2016, 2015, 2021, 2017, 2011, 2021, 2018, 2010, 2015, 
2017, 2021, 2012, 2014, 2013, 2010, 2015, 2011, 2015, 2019, 2012, 
2010, 2010, 2020, 2021, 2016, 2012, 2016, 2011, 2014, 2016, 2009, 
2019, 2015, 2017, 2018, 2014, 2021, 2017, 2010, 2013, 2016, 2020, 
2014, 2017, 2013, 2018, 2019, 2013, 2011, 2019, 2011, 2013, 2013, 
2014, 2009, 2018, 2018, 2009, 2021, 2015, 2015, 2018, 2014, 2015, 
2012, 2018, 2014, 2017, 2015, 2010, 2016, 2013, 2019, 2016, 2014, 
2009, 2019, 2009, 2018, 2013, 2011, 2020, 2020, 2009, 2012, 2011, 
2010, 2010, 2017, 2012, 2012, 2009, 2014, 2009, 2016, 2019, 2009, 
2010, 2019, 2014, 2010, 2009, 2018, 2018, 2014, 2016, 2009, 2013, 
2020, 2012, 2019, 2021, 2016, 2021, 2009, 2016, 2014, 2015, 2018, 
2010, 2016, 2016, 2016, 2010, 2011, 2015, 2014, 2009, 2009, 2012, 
2011, 2012, 2018, 2015, 2019, 2018, 2021, 2016, 2019, 2019, 2014, 
2021, 2021, 2018, 2014, 2021, 2010, 2020, 2010, 2014, 2011, 2012, 
2021, 2020, 2009, 2016, 2018, 2011, 2018, 2014, 2019, 2014, 2020, 
2014, 2014, 2019, 2014, 2020, 2021, 2012, 2015, 2010, 2011, 2009, 
2009, 2010, 2016, 2016, 2011, 2019, 2021, 2010, 2020, 2009, 2016, 
2016, 2017, 2015, 2013, 2019, 2012, 2012, 2014, 2011, 2013, 2011, 
2015, 2015, 2009, 2016, 2016, 2021, 2009, 2021, 2009, 2019, 2013, 
2019, 2012, 2019, 2011, 2020, 2012, 2015, 2009, 2012, 2020, 2011, 
2010, 2018, 2010, 2012, 2009, 2015, 2014, 2021, 2019, 2009, 2018, 
2018, 2021, 2019, 2013, 2015, 2010, 2017, 2014, 2019, 2011, 2018, 
2011, 2017, 2012, 2014, 2014, 2014, 2011, 2014, 2016, 2016, 2020, 
2019, 2019, 2017, 2017, 2016, 2011, 2009, 2012, 2012, 2017, 2010, 
2015, 2013, 2016, 2019, 2014, 2019, 2018, 2018, 2016, 2020, 2009, 
2012, 2014, 2015, 2017, 2020, 2013, 2010, 2012, 2009, 2012, 2009, 
2017, 2020, 2017, 2021, 2015, 2020, 2017, 2019, 2018, 2016, 2011, 
2020, 2016, 2019, 2018, 2020, 2014, 2012, 2010, 2016, 2012, 2014, 
2009, 2012, 2015, 2009, 2021, 2013, 2011, 2009, 2014, 2017, 2019
), first_name = c("alice", "sarah", "judith", "mary", "brooke", 
"lisa", "michelle", "marian", "frederick", "ryan", "meaghan", 
"nicolaas", "peter", "tara", "sharon", "neira", "keith", "laura", 
"seth", "daniel", "richard", "david", "linda", "michael", "kristi", 
"timothy", "janet", "amy", "sharon", "erin", "suzanne", "timothy", 
"bradley", "matthew", "cathy", "kathleen", "monica", "john", 
"anita", "kristen", "eddie", "michael", "david", "sean", "gregory", 
"shana", "christopher", "christopher", "elizabeth", "sheila", 
"derick", "susan", "craig", "samuel", "margaret", "stephen", 
"william", "joshua", "tami", "courtney", "hedy", "sara", "francis", 
"kristopher", "cortland", "william", "sharon", "james", "robert", 
"barbara", "michael", "kevin", "leona", "sylvie", "dan", "casey", 
"helen", "keeley", "carla", "laura", "theresa", "isaac", "shelley", 
"kevin", "samantha", "tamatha", "james", "michael", "merideth", 
"carl", "anthony", "david", "susan", "laurie", "donald", "christopher", 
"rhonda", "jennifer", "thomas", "thea", "margaret", "melvin", 
"lucy", "samuel", "ashley", "deett", "naida", "nicole", "aaron", 
"marsha", "william", "nicolaas", "eloise", "mary", "bryan", "kevin", 
"nicole", "david", "karen", "dean", "elmer", "stacy", "sarah", 
"connie", "barbara", "ann", "julie", "harold", "anthony", "kevin", 
"jason", "karen", "donna", "christine", "philip", "richard", 
"alex", "garrett", "brad", "steven", "christopher", "stephen", 
"peter", "samantha", "thomas", "stuart", "daniel", "cassandra", 
"james", "arthur", "manuel", "daniel", "amanda", "robert", "ellen", 
"ryan", "matthew", "alyssa", "alexis", "jeffrey", "gloria", "stephanie", 
"christopher", "shannon", "john", "thomas", "brian", "ahmet", 
"tiffany", "alex", "lucy", "ashley", "hannah", "robert", "stuart", 
"kathleen", "gabriel", "vinny", "brandy", "shelia", "andrew", 
"cindy", "neil", "stephen", "nicole", "elaine", "joshua", "christopher", 
"kelly", "jessica", "kira", "gary", "renee", "scott", "jeremy", 
"brenda", "howard", "w", "winston", "albert", "kris", "calvin", 
"patricia", "ross", "kristin", "ray", "skylar", "caitlin", "amanda", 
"joseph", "brian", "sandra", "ryan", "ben", "anitalouise", "joan", 
"paul", "tammy", "lisa", "brian", "mary", "nicole", "michael", 
"alyssa", "sandra", "kristen", "samantha", "betty", "laurie", 
"richard", "angela", "john", "donald", "neil", "sally", "margery", 
"katelyn", "george", "caroline", "wendy", "april", "daisy", "kristina", 
"lisa", "jessica", "cory", "barbara", "rodney", "elexandra", 
"paul", "robert", "cathy", "phyllis", "richard", "clancy", "sabine", 
"neil", "tammy", "linda", "paul", "megan", "tim", "sara", "kristen", 
"julie", "amanda", "milan", "kathleen", "john", "ronald", "james", 
"scott", "christine", "susan", "brenda", "janet", "richard", 
"stephen", "james", "jay", "scott", "nicole", "amanda", "cecile", 
"michael", "logan", "conrad", "kenneth", "samantha", "steven", 
"barbara", "laura", "jennifer", "jessica", "cierra", "tammie", 
"teodoro", "terrance", "michelle", "nicole", "heather", "adam", 
"david", "matthew", "gregory", "scarlett", "christopher", "elizabeth", 
"linda", "wendy", "maureen", "robert", "keri", "june", "bruce", 
"brett", "kristin", "christopher", "christopher", "rachel", "tammy", 
"sueann", "garold", "samuel", "marian", "kim", "ashley", "robert", 
"brandi", "richard", "james", "brenda", "michael", "craig", "teresa", 
"brian", "andrea", "james", "patricia", "daniel", "john", "clare", 
"judith", "casey", "brock", "robert", "susan", "stephen", "devon", 
"pamela", "marlowe", "lawrence", "mary", "andrew", "denise", 
"evan", "christopher", "brian", "tracy", "kimberly", "jace", 
"julia", "alysha", "robert", "johnathan", "jason", "deborah", 
"jamie", "laurie", "stefan", "robert", "peter", "david", "hollis", 
"cynthia", "samara", "sean", "elizabeth", "kevin", "aaron", "margaret", 
"daniel", "amy", "steven", "dylan", "cynthia", "ellen", "tammy", 
"daniel", "dragica", "julie", "stephen", "nicholas", "heather", 
"johnathan", "arthur", "john", "thomas", "karen", "michael", 
"william", "michael", "sharon", "alena", "monica", "amy", "thomas", 
"lisbeth", "nicole", "april", "lara", "hazel", "jessie", "brigham", 
"hamed", "christopher", "daniel", "kimberli", "carlton", "james", 
"yasin", "tom", "justin", "ean", "joshua", "rizardo", "katie", 
"zachary", "gregory", "jennifer", "charles", "erik", "amanda", 
"chaveli", "emanuel", "erin", "elizabeth", "crystal", "timothy", 
"christopher", "grace", "dylan", "edith", "mark", "beth", "wendy", 
"natalie", "margaret", "jacob", "suzanne", "chandler", "nyima", 
"robert", "bernadette", "katherinlynn", "timothy", "gary", "jessica", 
"andrew", "kristen", "robert", "sandra", "julie", "richard", 
"guy", "tammy", "ernest", "heidi", "ethan", "john", "hera", "scott", 
"cheryle", "brian", "michele", "edward", "thomas", "philip", 
"dawn", "tommy", "eleanor", "sille", "lori", "lucinda", "ashley", 
"david", "john", "karen", "daniel", "phillip", "leslie", "jeffrey", 
"jamie", "wendy", "anjel", "julie", "allison", "heidi", "chad", 
"jennifer"), middle_name_initial = c("m", "j", "w", "e", "a", 
"a", NA, NA, "w iii", "c", "f", "j", "d", NA, NA, NA, "m", "ann", 
"e", "allen", NA, "e", "l", "j", "l", "b", NA, "l", "r", "e", 
NA, "j", NA, "michael", "lee", "m", "l", "p", NA, NA, "p", "d", 
"s", "m", "r", "l", NA, "w", "a", "m", "a", "a", NA, "p", "b", 
"a", "john", "p", NA, "t", "a", "anne", "x", "j", "t", NA, "k", 
"jerrett", "f", "l", "s", "r", "m", "m", "m", "allen", "e", "b", 
"m", "anne", "a", NA, "s", NA, "j", "j", "d", "a", NA, "b", NA, 
"r", "m", NA, "j", "d", "f", NA, "ian", "j", "l", "p", "m", NA, 
"s", NA, "a", NA, "t", "l", "c", "j", NA, "c", NA, "m", "a", 
NA, "e", "william", "j", "lynn", NA, "m", "m", "m", "leann", 
"k", "p", "j", "h", "k", "marie", "h", "b", "s", "p", "m", "c", 
"a", "w", "m", "david", "c", "j", "nils", "h", "s", "b", NA, 
"paul", NA, "jo", "p", "grace", "a", NA, "m", "m", "r", "k", 
"d", NA, "a", "joseph", "e", "gregory", NA, "j", "m", NA, NA, 
"r", "e", "g", "m", NA, "m", "ann", "r", NA, "a", NA, "alan", 
"a", "l", "m", "w", NA, NA, "lindsey", "richard", "l", "robert", 
"k", "j", "a", "b", NA, "m", "a", NA, "a", "l", "charmaine", 
NA, "e", NA, NA, "arthur", "k", "s", "charles", "d", NA, "m", 
NA, "jean", "dulsky", "c", "b", "m", "j", "r", NA, NA, "a", "a", 
"d", NA, "m", "t", "l", "w", "j", "c", NA, NA, "a", "lee", "e", 
NA, NA, "marie", "m", "r", "j", "w", "m", "h", "donald", NA, 
"a", "s", "i", NA, "d", "j", "l", "e", "a", NA, "e", "a", "a", 
"m", NA, "m", "m", "l", "h", "t", "m", "h", "jean", "k", "f", 
"andrew", NA, "e", "t", "m", "r", "j", "scott", "s", "m", "d", 
"j", "p", "l", "v", "a", "martinez", "a", "l", "a", "b", "d", 
NA, "a", NA, "c", "c", "a", "p", NA, "m", NA, "j", NA, "e", "l", 
"b", "l", "a", "m", NA, NA, NA, NA, "christie", "f", "charles", 
"m", "c", "m", "h", "nicole", "milton", "r", NA, "d", "a", NA, 
"e", "elizabeth", "p", "c", "scott", "t", "l", NA, "steven", 
"n", "t", "e", "d", "ms", "j", NA, "p", "ellen", NA, "despina", 
"m", "l", NA, NA, "marie", NA, "a", "n", NA, "f", "c", "l", NA, 
"e", "w", "e", "d", "s", NA, "l", "s", NA, "b", "a", "a", "m", 
"c", "marie kravetz", "j", NA, "l", "e", "m", "d", NA, "m", "w", 
"lars", NA, "james", "j", "m", "j", "a", "t", "j", "l", "dwyer", 
"m", "a", "marie", "j", "a", "r", "j", NA, NA, "laine", NA, NA, 
"p", "f", "a", "w", "e", "m", "i", "r", NA, "a", "a", NA, "edward", 
"p", "l", "j", "m", NA, NA, NA, NA, "s", "mae", "c", NA, "f", 
NA, "l", "t", "b", "a", "c", "susan", NA, NA, "alexander", NA, 
"a", "m", "e", "p", "t", "mary", "a", "m", "lukas", NA, NA, NA, 
"s", "l", "h", "j", NA, "ross", "morgan", "f", "a", "c", "m", 
"g", "e", "s", "m", "j", NA, "lyster", "ann", "a", "m", "t", 
"edward", "a", "c", "j", "g", "l", "a", "l", "brittany", "j", 
"r", "lynn", "lee", NA), last_name_join = c("emmons", "copen", 
"ehrlich", "bizzari", "brittell", "bruce", "hastry", "petrides", 
"ross", "knox", "kelley", "garbacik", "davenport", "lombardi", 
"mallory", "valentic", "gallant", "nicolai", "hisman", "thompson", 
"boulanger", "tremblay", "bates", "wilson", "wheeler", "clear", 
"carpenter", "harrington", "holland", "hodges", "santarcangelo", 
"pricer", "pilette", "conte", "hartshorn", "hill", "light", "pellegrini", 
"chadderton", "vrancken", "earle", "walker  ii", "bailey", "hilpl", 
"schlueter", "blanchard", "chadwick", "olson", "riley", "merchant", 
"lind", "zeller", "digiammarino", "truex", "schwartz", "brooks", 
"orosz  iii", "hulett", "walker", "sanford", "harris", "molino", 
"aumand  iv", "cronin", "corsones", "pendlebury", "batdorff", 
"braid", "gunn  jr", "giffin", "bertrand", "klamm", "goebel", 
"hebert", "fraysier", "laplante", "suntag", "weening", "frappier", 
"conway", "wood", "sponem", "jerman", "aremburg", "baigelman", 
"green", "reed  jr", "carlisle", "plumpton", "mallette  jr", 
"egizi", "lambert", "teske", "dahlin", "einhorn", "billado", 
"sheffield", "royer", "mcmurdo", "schwartz", "burke", "quesnel", 
"boyden", "winship", "godzik", "cross", "beutel", "hersey", "connor", 
"rowell", "deveneau", "garbacik", "harris", "spicer", "scrubb", 
"lacross", "lyford", "hosford", "nelson", "webb", "deforge  iii", 
"carpenter", "trombly", "laplant", "prentice", "gosselin", "gilpin", 
"rock", "manfredi", "mullin", "maxham", "lamorder", "amiot", 
"howe", "scott", "donahey  iii", "emerson", "gonzalez", "james", 
"chadwick", "baird", "riendeau", "dufault", "tullar", "trudeau", 
"johnson", "raddock", "edson", "euber", "blackhawk", "sainz", 
"jarvis", "baslow", "kenney", "livingston", "quenneville", "cetin", 
"mullan", "mclean", "merrell", "naughton", "lanphear", "chadwick", 
"huntington", "savasta", "unkles", "irish", "mujkanovic", "mason", 
"dees", "leriche", "whitehill", "phelps", "ryan", "schurr", "pickens", 
"cameron", "barbiero", "robillard", "martin", "bernier", "laraway", 
"herrick", "dixon", "corrao", "duke", "harless", "boucher", "ireland", 
"hill", "mclenithan", "mcginnis", "spaulding", "carpenter", "thompson", 
"persons", "boutwell", "young", "weston  jr", "metayer", "rowley", 
"kelley", "damery", "nagy", "wackley", "allen", "guidadailey", 
"stanton", "mable", "laporte", "ingalls", "holt", "mcpartland", 
"katz", "thomason", "rock", "badger", "mellish", "watkins", "nowak", 
"munger", "tousignant", "mcardle", "chaffee", "lorette", "rajewski", 
"devenger", "nuovo", "sabens", "spiese", "owen", "maccallum", 
"shaw", "monteith", "adams", "reurink", "homeyer", "buzzell", 
"keller", "dubois", "schwendler", "berbeco", "kiarsis", "vautrain", 
"pomainville", "cheever", "morway", "pratt", "arthers", "erlbaum", 
"gallipo", "cappetta", "girard", "rowden  ii", "desmet", "baldwin", 
"desmond", "kennison", "riddell", "stagner", "grenier", "brick", 
"teachout", "goldstein", "young", "preston", "hladik", "peyerl", 
"becker", "taylor", "greenwood", "crowley", "dorer", "tarshis", 
"colburn", "dunigan", "riviezzo", "marcoux  jr", "smith", "dudley", 
"cookingham", "twamley", "morrison", "manley", "furgat", "riley", 
"cote", "mason", "fontaine", "bullard", "johnston", "lapierre", 
"hart", "berry", "carter", "beauregard", "remolador", "richardson", 
"salvador", "rogers", "hannan", "silverman", "willey", "langham", 
"rainville", "burgess", "barker", "lawrence", "dilena", "edwards", 
"carr", "ryan", "starr", "sundberg", "whipple", "perry", "martin", 
"berg", "adams", "perryhannam", "pidgeon", "clark", "davis", 
"jensen", "crandall", "winslow", "berliner", "finnegan", "crisante", 
"norway  jr", "jones", "caforiaweeber", "osborne", "dusablon", 
"dimas", "turner", "dumas", "deeghan", "halloran", "zorzi", "mandeville", 
"oshaughnessy", "henkin", "leach", "rutter", "letourneau", "meyer", 
"mccarthy", "coleman", "skriletz", "galbraith", "cupoli", "west", 
"willette", "keating", "wojtkowiak", "corbin", "alimena", "harrington", 
"humphrey", "curtis", "paradiso", "kane", "melcher", "croft", 
"deforge", "mercy", "christian", "brown", "strohmaier", "suckert", 
"danles", "laclair", "schwannoble", "boron", "fitzgerald", "krevetski", 
"mcdonald", "pisani", "moore", "jansch", "marcellus", "saunders", 
"yannacone", "lawrence", "davis", "parr doering", "saltus", "roy", 
"gagulic", "arms", "simoes", "olson", "fazekas", "bixby", "hamlin", 
"federico", "donovan  jr", "brooks", "collins", "soule", "ryan", 
"stepp", "farrell", "steel", "mercier", "malinowski", "kokx", 
"marabella", "green", "sobel", "fay", "mackinnon", "reese", "kone", 
"cadorette", "nicasio", "ward", "fuller", "altman", "ibrahim", 
"evslin", "burkewitz", "briere", "larose", "reynoso", "kinter", 
"manchester", "kalinoski", "isham", "keefer", "fischer", "ciecior", 
"miles", "betz", "singer", "white", "zada", "kilby", "barker", 
"winters", "newton", "sullivan", "ferguson", "irvine", "walker", 
"santamore", "becker", "bovee", "bushee", "corbett", "tsamchoe", 
"farr", "vermette", "fox", "hewitt", "nowak", "burrill", "adams", 
"jensen", "wohland", "maisonet", "campbell", "powell  ii", "tapper", 
"wilt", "patnoe", "ibey", "mclaughlin", "woodward", "bosley", 
"williams", "weeden", "donaldson", "betit", "domey", "fields", 
"daye", "fleury", "walz", "smith", "larsen", "potvin", "holtrop", 
"gregory", "minot", "adams  iv", "lafond", "riley", "edgerley", 
"russell", "martin", "sheltra", "houstonanderson", "robins", 
"green", "keelty", "austin", "gibney", "quero"), suffix = c(NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, " ii", NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, " iii", NA, NA, NA, NA, NA, " iv", 
NA, NA, NA, NA, NA, " jr", NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, " jr", NA, NA, " jr", NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, " iii", NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, " iii", NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, " jr", NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, " ii", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, " jr", NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, " jr", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, " jr", NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, " ii", NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, " iv", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA)), row.names = c(NA, -500L), class = c("tbl_df", "tbl", "data.frame"
))

fl_out <- fastLink(
  dfA = sample_data, 
  dfB = sample_data,
  varnames = c("first_name", 
               "middle_name_initial",
               "last_name_join"),
  stringdist.match = c("first_name", 
                       "middle_name_initial",
                       "last_name_join"),
  partial.match = c("first_name", 
                    "middle_name_initial",
                    "last_name_join"),
  threshold.match = .90,
  n.cores = 5)


matches_out <- getMatches(
  dfA = sample_data, 
  dfB = sample_data,
  fl.out = fl_out,
  threshold.match = .90)

matches_out %>% 
  filter(dedupe.ids == 325)

Screen Shot 2022-11-07 at 1 59 25 PM

Question - Recommendation for geocoding

Hi!

I have a question about how you'd suggest implementing links using geocoded records? I've considered doing the latitude and longitude numeric comparisons or perhaps reweighting the posterior probability based on the geodesic distance. Just spit-balling. Thanks!

Stewart

nameReweight NA issue

Hello everyone,

Nice work with the package. It works well for me.

I just have a few question about reweighting posterior probabilities. After using nameReweight or just fastLink with nameReweight and firstname.field I only have NA in zeta.name. I don't understand why. I looked in the function and it should be because of that : 'matches.names.A$zeta.j.names[matches.names.A[,ind] != 2] <- NA'. But I don't understand it.

Also I would need to reweight using more than one field. I already did some modification but I wanted to know if there was any reason why you didn't do it.

In fact I realized that I'm not really of how to use the nameReweight function. Could you explain me ?

Best,

Emeric

How to return multiple matches

I have two datasets:

  1. test_pool is a panel of names across years. But sometimes the names change slightly, as they are hand-coded.
  2. test_key contains a sample of the names in test_pool, but only from the most recent year.

Here is simplified example data:

test_pool <- structure(list(year = c(2010, 2012, 2014, 2014),
                              first_name = c("Jose", "Joe", "Jose", "Todd")),
                         row.names = c(NA, -4L),
                              class = c("tbl_df", "tbl", "data.frame"))
test_key <- structure(list(year = c(2014),
                         first_name = c("Jose")),
                    row.names = c(NA, -1L),
                    class = c("tbl_df", "tbl", "data.frame"))

I would like to use fastLink to find every occurrence of "Jose" in test_pool, regardless of the year. Also, I would like capture the times that "Jose" is accidentally written as "Joe". To do so, I have written the following:

fl_out <- fastLink(
  dfA = test_key, 
  dfB = test_pool,
  varnames = c("first_name"),
  stringdist.match = c("first_name"),
  dedupe.matches = F)

matches_out <- getMatches(
  dfA = test_key, 
  dfB = test_pool,
  fl.out = fl_out)

However, this code only records for "Jose" from 2014:

> matches_out
    year first_name gamma.1          posterior
1   2014       Jose       2 0.9694313510032212
1.1 2014       Jose       2 0.9694313510032212

How can I use fastLink to find the records for c("Jose", 2010) and c("Joe", 2012)? Any help at all would be greatly appreciated.

pre-processing dplyr/data.table columns

Using datasets that are dplyr or data.table objects instead of simply data.frame isn't a good input for the gammaCKpar() inside fastLink.

It might not be worth changing the code because data.frame is the standard. I thought I'd post because there was no warning and it took me a while to figure this out when using fastLink. The code below should be reproducible.

library(fastLink)

## data frame management packages some people use
library(dplyr)
library(data.table) 

## example data
data(samplematch) 
class(dfA) # data.frame object

## Suppose these were dplyr objects, not data frames
## e.g. dplyr
dfA.dp <- tbl_df(dfA)
dfB.dp <- tbl_df(dfB)

## e.g. data.table
dfA.dt <- as.data.table(dfA)
dfB.dt <- as.data.table(dfB)

class(dfA.dp) 
head(dfA.dp[, "firstname"]) # not quite a vector, data will be ignored in gammaKpar
class(dfA.dt) 
head(dfA.dt[, "firstname"]) # same for data.table


## Run gammaCK for a given variable using syntax from fastLink()
varname.i <- "firstname"

agr    <- gammaCKpar(dfA[, varname.i], dfB[, varname.i]) 
agr.dp <- gammaCKpar(dfA.dp[, varname.i], dfB.dp[, varname.i]) # no error message
agr.dt <- gammaCKpar(dfA.dt[, varname.i], dfB.dt[, varname.i]) # no error message

length(agr$matches1)    # well-populated
length(agr.dp$matches1) # empty
length(agr.dt$matches1) # empty

Extracting matches when using blocking

Hi there, I am trying to identify duplicates in a large dataset. I am blocking on several variables, aggregating with aggregateEM() and then trying to extract the matches with getMatches(). It looks like getMatches() won't work with the fastLink.aggregate class. Is there some other way to get the same functionality?

Reprex:

library(fastLink)
library(foreach)

data <- data.frame(gender = c(1,2,1,1,1,1,2,2,1,2),
                   age = c(18, 25, 18, 35, 45, 55, 65, 76, 87, 98))

blocks <- blockData(data, data, varnames = c("gender"))

tmp_clus <- parallel::makeCluster(spec = parallel::detectCores()-2, 
                                  type = 'PSOCK')  
doParallel::registerDoParallel(tmp_clus)

em_list <- foreach::foreach(i = 1:length(blocks), .verbose = F) %dopar%
  {
    library(fastLink)
    data_block <- data[blocks[[i]]$dfA.inds,]
    
    fastLink(
      dfA = data_block, dfB = data_block, 
      varnames = c("gender", "age")
      )
  }
parallel::stopCluster(tmp_clus)

em_aggregated <- aggregateEM(em_list)

data_dedupe <- getMatches(dfA = data, dfB = data,
                          fl.out = em_aggregated)

# Error in getMatches(dfA = data, dfB = data, fl.out = em_aggregated) : 
#   dfA and dfB are identical, but fl.out is not of class 'fastLink.dedupe.' Please check your inputs.

Warning in emlinkMARmov in case of a poor match

I've been looking quite extensively into fastLink algorithm and notably the edge cases. In doing so, I think I've encountered a bug in emlinkMARmov in the case of a poor match. If you unzip the attached file, you may run

library(fastLink)
load("fastlink.Rdata")
emlinkMARmov(patterns, nobs.a, nobs.b)

If you try the above snippet out, you will obtain in most cases a warning that p.old is of different length than p.new. It doesn't happen always due to probable randomization. The reason is that after a few iterations, p.m = 0 and consequently num.prod = 0. Due to zero division, p.gamma.k.m becomes NaN and sorting on NaN leads to an empty vector - hence, the difference in length of p.old and p.new. Since p.old and p.new are of different lengths, it doesn't really make sense to subtract one from the other as you are recycling.

I was thinking that p.m = 0 essentially means there is no match, hence it doesn't really matter if we discontinue EM-algorithm and get out of the loop, i.e. if (p.m == 0) break. I was wondering if you have some other ideas around it.

By the way, fastLink is very nice work. Thanks a lot for it.

fastlink.zip

Error in mclapply() on Windows

I am using fastLink on confidential data and get an error in mclapply(). I am using fastLink version 0.1.1 on Windows 7 with 4 cores.

This is the problematic R command:

library(fastLink)
> fl.out <- fastLink(rpatient7, racs7, 
>     varnames = c("bstate", "sex", "nysf", "nysl", "ssn", "dob"),
>     stringdist.match = c("nysf", "dob"), n.cores = 2)

This is the problematic R output:

==================== 
fastLink(): Fast Probabilistic Record Linkage
==================== 

Calculating matches for each variable.
Error in mclapply(matches.2, function(s) { : 
  'mc.cores' > 1 is not supported on Windows

Immediately after the error, I typed traceback() and this is the result:

> traceback()
4: stop("'mc.cores' > 1 is not supported on Windows")
3: mclapply(matches.2, function(s) {
       ht1 <- which(matrix.1 == s[1])
       ht2 <- which(matrix.2 == s[2])
       list(ht1, ht2)
   }, mc.cores = getOption("mc.cores", no_cores))
2: gammaCK2par(dfA[, varnames[i]], dfB[, varnames[i]], cut.a = cut.a, 
       method = stringdist.method, w = jw.weight, n.cores = n.cores)
1: fastLink(rpatient7, racs7, varnames = c("bstate", "sex", "nysf", 
       "nysl", "ssn", "dob"), stringdist.match = c("nysf", "dob"), 
       n.cores = 2)

If I change the syntax from n.cores = 2 to n.cores = 1 (or if I omit the option) then the R output is fine.

I could not reproduce the error on datasets dfA and dfB. The problem with mclapply() on Windows is discussed further at https://www.r-bloggers.com/implementing-mclapply-on-windows-a-primer-on-embarrassingly-parallel-computation-on-multicore-systems-with-r/

Please advice.

Question: window blocking conditional on another variable

Background: We would like to use fastLink to link data on road crashes from police and hospital records. In the blocking phase, we would like to set a window for the time between the crash as registered by the police and the time the patient arrived in the hospital. However, this window is likely to depend on the severity of the injuries: patients with milder injuries can take more time to arrive in the hospital than patients with severe injuries.

Question: Would it be possible to include dependencies between variables while blocking using fastLink? Can the window size be dependent on the value of a second variable? (in our case, we would like to have a smaller window size for more severe injuries)

Many thanks!

"undefined columns selected" in dedupeMatched

When dfA contains a irrelevant-for-matching variable that is not in dfB, dedupeMatches breaks here:

matchesB <- subset(matchesB, select = colnames.df)

It looks like specifying two types of colnames.df earlier in the code (instead of one based on matchesA) will prevent this from happening ?

MWE:

library(fastLink)
data(samplematch)

dfAextra <- data.frame(dfA, extra = 1:nrow(dfA)) # this var not used for matching
out <- fastLink(dfAextra, dfB, varnames = c("firstname", "middlename", "lastname"))

# ==================== 
# fastLink(): Fast Probabilistic Record Linkage
# ==================== 
# 
# Calculating matches for each variable.
# Getting counts for zeta parameters.
# Parallelizing gamma calculation using 1 cores.
# Running the EM algorithm.
# Getting the indices of estimated matches.
# Parallelizing gamma calculation using 1 cores.
# Deduping the estimated matches.
# Error in `[.data.frame`(x, r, vars, drop = drop) : 
#   undefined columns selected

How to avoid the error "You have no exact matches ... Please drop this variable"?

I tried to reproduce a linkage I had done with RecordLinkage (R) in order to get realistic testing but in fastLink on Windows I get the error "You have no exact matches ... Please drop this variable". Because I do not want to drop the variable from my analysis, how can I avoid the error?

The error message is because of lines 225-227 in fastLink.r

        if(sum(dfA[,varnames[i]] %in% dfB[,varnames[i]]) == 0){
            stop(paste0("You have no exact matches for ", varnames[i], ". Please drop this variable from your analysis."))
        }

I have an average PC with 8 GB of RAM. However, my datasets have about 75,000 and 3 million observations which is about 100 times more than in your example on pages 24-25 in the paper. To reduce the datasets about 100 times, therefore, I used clusterMatch() with 100 clusters. If I use "only" 50 clusters, then the error disappears but the run time is much slower.

The problem does not occur in RecordLinkage because RecordLinkage allows more blocking than just gender. The error in fastLink occurred for the 9-digit Social Security Number (SSN) which overall is the best matching variable, so I do not want to drop the variable.

I know you are working hard on other things such as the confusion matrix but the error is problematic. One possible quick fix might be to change the error to a warning only. Another possibility is that I drop the offending cluster(s) but that seems like a very crude work around. A third possible solution, which I suggest is better but harder to implement, would be to allow more restrictive blocking than only on gender in order to enable smaller blocking pairs, less need for many clusters, and faster run time.

--Anders

Confusing $addition.info from new function confusion()

Thank you for adding the confusion table in version 0.3.0. However, I think I found a bug due to a logical error for the additional statistics ($addition.info). These additional statistics from https://github.com/kosukeimai/fastLink/blob/master/R/confusion.R do not match up with Stata's classtabi command unless you reverse cells A and D.

Based on prior conversation, I assume the reason is that you use the same formulas but, as noted in the help file of classtabi under Remarks, in classtabi cell A = True Negative (TN) and cell D = True Positive (TP) whereas in confusion.R cell A = TP and cell D = TN. This is best illustrated with a simple reproducible example.

Example of confusion.R in fastLink:

# run fastLink() and get the confusion table
> library(fastLink)
> data(samplematch)
> 
> out <- fastLink(
+   dfA = dfA, dfB = dfB, 
+   varnames = c("firstname", "middlename", "lastname"),
+   stringdist.match = c("firstname", "middlename", "lastname"),
+   return.all=TRUE)
> ct <- confusion(out)

# display summary results
> summary(out)
                  95%     85%     75%   Exact
1 Match Count      50      50      50      43
2  Match Rate 14.225% 14.225% 14.225% 12.286%
3         FDR  0.426%  0.426%  0.426%        
4         FNR  1.378%  1.378%  1.378% 

> ct
$confusion.table
                     'True' Matches 'True' Non-Matches
Declared Matches               49.8                0.2
Declared Non-Matches            0.9              331.3

$addition.info
                                           results
Max Number of Obs to be Matched 382.19999999999999
Sensitivity (%)                  99.70000000000000
Specificity (%)                  99.59999999999999
Positive Predicted Value (%)     99.90000000000001
Negative Predicted Value (%)     98.09999999999999
False Positive Rate (%)           0.40000000000000
False Negative Rate (%)           0.30000000000000
Correctly Clasified (%)          99.70000000000000

Reproduce example using classtabi in Stata:

# Multiply all cells by 10 because classtabi requires integers
 classtabi 498 2 9 3313

           |          col
       row |         0          1 |     Total
-----------+----------------------+----------
         0 |       498          2 |       500 
         1 |         9      3,313 |     3,322 
-----------+----------------------+----------
     Total |       507      3,315 |     3,822 



-------------------------------------------------
Sensitivity                     D/(C+D)   99.73%      
Specificity                     A/(A+B)   99.60%      
Positive predictive value       D/(B+D)   99.94%      
Negative predictive value       A/(A+C)   98.22%      
-------------------------------------------------
False positive rate             B/(A+B)    0.40%      
False negative rate             C/(C+D)    0.27%      
-------------------------------------------------
Correctly classified      A+D/(A+B+C+D)   99.71%      
-------------------------------------------------
Effect strength for sensitivity           99.33%      
-------------------------------------------------
ROC area                                  0.9966      
-------------------------------------------------

To grasp what I think is a bug, due to logical error, we can we compare with Wikipedia. For example, the first mentioned statistics "Sensitivity" on Wikipedia and in Stata is defined as TP / (TP + FN) whereas you seem to define it as TN / (FN + TN). A more academic reference is Methodological Developments in Data Linkage by Harron, Goldstein, and Dibben. There, to use the same example (in chapter 4 on page 81) of "Sensitivity" again the definition is as in Stata and Wikipedia. The Stata example below reverses cells A and D to illustrate the difference and what I think are the correct results if we use standard terminology.

Show in Stata what the $addition.info should be:

 
# Multiply all cells by 10 because classtabi requires integers
# Required syntax:     classtabi #a #b #c #d
# Helpfile states: #a = TN, #b = FP, #c = FN, #d = TP
. classtabi 3313 2 9 498

           |          col
       row |         0          1 |     Total
-----------+----------------------+----------
         0 |     3,313          2 |     3,315 
         1 |         9        498 |       507 
-----------+----------------------+----------
     Total |     3,322        500 |     3,822 



-------------------------------------------------
Sensitivity                     D/(C+D)   98.22%      
Specificity                     A/(A+B)   99.94%      
Positive predictive value       D/(B+D)   99.60%      
Negative predictive value       A/(A+C)   99.73%      
-------------------------------------------------
False positive rate             B/(A+B)    0.06%      
False negative rate             C/(C+D)    1.78%      
-------------------------------------------------
Correctly classified      A+D/(A+B+C+D)   99.71%      
-------------------------------------------------
Effect strength for sensitivity           98.16%      
-------------------------------------------------
ROC area                                  0.9908      
-------------------------------------------------

Matching Dates

I have a column converted to date format (without time) using as.Date(), lubridate::as_date() and even `as.Date.character().

> df$date_of_contact[1]
[1] "2020-11-15"
> class(df$date_of_contact)
[1] "Date"

When I try to use that as a variable in fastLink, throws the error:

Error in charToDate(x) : 
  character string is not in a standard unambiguous format

How do I include dates as part of the match? I'm trying to do an exact match to deduplicate a dataframe.

Col::subvec() error with some data

I can run fastLink() stringdist.match on two variables on my datasets with up to 20k rows but with more rows, I get the following error (and a crash in RStudio) during the "Getting counts for parameter estimation" phase:

error: Col::subvec(): indices out of bounds or incorrectly used
terminate called after throwing an instance of 'std::logic_error'
  what():  Col::subvec(): indices out of bounds or incorrectly used

question / supporting other languages

Hi Ted and fastLink developers,

Thank you for creating this amazing package and answering my previous questions.

My team is wondering if you have written - or have plans to write - fastLink in other languages, especially Python, Java, or Scala. Would you shed some light?

Thanks,
Katie

dedupeMatches() fails on single-variable matches

Thank you for the package! It was released very timely for my work.

The issue I faced when trying on my datasets was that comparison on only one string variable would fail. The same error can be replicated with the test datasets:

> library(fastLink)
> data(samplematch)
> matches.out <- fastLink(
+   dfA = dfA, dfB = dfB, 
+   varnames = c("firstname"),
+   stringdist.match = c("firstname"),
+   partial.match = c("firstname")
+ )

==================== 
fastLink(): Fast Probabilistic Record Linkage
==================== 

Calculating matches for each variable.
Getting counts for zeta parameters.
(Using OpenMP to parallelize calculation. 1 threads out of 4 are used.)
Running the EM algorithm.
Getting the indices of estimated matches.
(Using OpenMP to parallelize calculation. 1 threads out of 4 are used.)
Deduping the estimated matches.

 Error in fix.by(by.y, y) : 'by' must specify a uniquely valid column 

The culprit is line 158 in tableCounts.R which coerces data.frame into vector when only one column is left after dropping counts.

It should be:

na.data.new <- data.new.1[, -c(nc), drop = FALSE]

Maybe some warning for single-variable matches would also be useful.

Long runtime on sampled data

I'm trying to get fastLink to merge two copies of the California voter file that are four years apart--one from 2012 and one from 2016. My strategy is to use the method from the APSR paper (at least as I understand it). But I'm getting stuck on what I thought would be the fast part.

I'm running fastLink first with 5% samples of each file. I then plan to block the full files on gender plus as many bins of first name as it takes to get down to about 250K cases in each bin (again, copying the APSR paper). I was assuming that it was best to run the sampled stage without blocking, because the blocks from the sampling needed to match up to the blocks from the full file which would leave too few units to match in each block (because the samples are so much smaller).

Bottom line is that I'm just brute-forcing the sampling stage. Here's the code (each file has been de-duped before hand):

d12.sub <- sample_frac(d12, size=0.05)
d16.sub <- sample_frac(d16, size=0.05)

rs.out <- fastLink(
dfA = d12.sub, dfB = d16.sub,
varnames = c("lname", "fname", "mname", "latlong", "bdate"),
stringdist.match = c("lname", "fname"),
partial.match = c("lname", "fname"),
estimate.only = TRUE
)

Now that I look at this, I realize that "latlong" is a string but wasn't identified that way. "mname" is also a string but with length one (i.e., just a middle initial). Not sure if that creates problems. At any rate, this has been running for the last 6 days, and is stuck here:

====================
fastLink(): Fast Probabilistic Record Linkage

If you set return.all to FALSE, you will not be able to calculate a confusion table as a summary statistic.
Calculating matches for each variable.
Getting counts for parameter estimation.
Parallelizing calculation using OpenMP. 54 threads out of 55 are used.

As you can see, I'm running this on servers that have a lot of parallelization capacity. Am I doing something wrong? If not, are there any recommendations for how to speed this up? Does it make sense for something like this to run so long? It has been running so long now that I'm now scared to stop it and play around, for fear it's just about to finish. Thanks in advance!

question - matching multiple datasets

Hi developers! Thank you so much for this package it is grreaattt!

I was wondering if there's a way to match multiple datasets dfA, dfB, dfC etc?

How scalable fastlink with mln rows tables?

I tried to search for this question but could not get any performance wise answers. Could anyone suggest whether fastlink is scalable enough for tables that exceeds mln rows. Thank you

Blocking strategy with millions of rows

I’m trying to link files with millions of lines (10M, up to 50M), using blocking to reduce the number of pairs.

The fastLink tutorial handles blocking by applying the algorithm independantly on each block. However, in my case that’s not possible because my blocking variables have a lot of different values and some of the blocks are small. Moreover, if I choose to use several blocking variables and consider the union of all remaining pairs, the blocks will not necessarily be disjoint.

The other strategy showed in the tutorial consists in estimating the parameters on a sample of the whole cartesian product and then applying these estimates on each block. But the litterature highlights that parameter estimation via the EM algorithm is biased when the proportion of true matches becomes very small, therefore I’m not comfortable with doing this either.

I would like to block and then apply the algorithm on all the remaining pairs, but that doesn’t seem compatible with the way blocking is implemented in fastLink.

Do you think of any other blocking strategy here?

not all patterns with NA counted?

It seems that if a variable has missing values, not all patterns are counted. Is this intended?

`

g1 = gammaCKpar(dfA$firstname, dfB$firstname)
g2 = gammaCKpar(dfA$lastname, dfB$lastname)
tc = tableCounts(list(g1, g2), nrow(dfA), nrow(dfB))
Parallelizing calculation using OpenMP. 1 threads out of 8 are used.
tc
gamma.1 gamma.2 counts
[1,] 0 0 172338
[2,] 1 0 271
[3,] 2 0 2170
[4,] 0 1 50
[5,] 0 2 120
[6,] 1 2 1
[7,] 2 2 50
attr(,"class")
[1] "fastLink" "tableCounts"`

No missing values in these two variables. Counts sum to 175000 (== 500 * 350), and pattern (2, 2) has count of 50.

Add middlename, which has missing values:

`> g3 = gammaCKpar(dfA$middlename, dfB$middlename)

t = tableCounts(list(g1, g2, g3), nrow(dfA), nrow(dfB))
Parallelizing calculation using OpenMP. 1 threads out of 8 are used.
t
gamma.1 gamma.2 gamma.3 counts
[1,] 0 0 0 115305
[2,] 1 0 0 193
[3,] 2 0 0 1477
[4,] 0 1 0 39
[5,] 0 2 0 79
[6,] 1 2 0 1
[7,] 0 0 1 24
[8,] 0 0 2 816
[9,] 1 0 2 2
[10,] 2 0 2 10
[11,] 0 2 2 1
[12,] 2 2 2 43
[13,] 0 0 NA 50690
[14,] 1 0 NA 68
[15,] 2 0 NA 615
[16,] 0 1 NA 10
[17,] 0 2 NA 37
attr(,"class")
[1] "fastLink" "tableCounts"`

Counts now sum to 169410 so it appears 5590 pairs have not been counted. Pattern (2, 2, 2) has count of 43, but there are no other patterns starting (2, 2, ...) so 7 pairs that match on both firstname and lastname do not seem to appear in this table.

When I made my own code (in Julia) to count patterns, I got the following result:
`
0 0 0 115305

1 0 0 193
2 0 0 1477
0 1 0 39
0 2 0 79
1 2 0 1
0 0 1 24
0 0 2 816
1 0 2 2
2 0 2 10
0 2 2 1
2 2 2 43
0 0 missing 56193
1 0 missing 76
2 0 missing 683
0 1 missing 11
0 2 missing 40
2 2 missing 7`

Differences from the fastLink results are all in the patterns containing missing values.

Uninformative error in fastLink during imputation

Hi,

While running fastLink I have gotten the same error several times with different parameters and I don't understand how to debug it.

The

My code is this:

fs.out <- fastLink(
dfA = dfA, dfB = dfB,
varnames = c('FirstName', 'LastName', 'DOB_str', 'street_num', 'street_name', 'Gender'),
stringdist.match = c('FirstName', 'LastName', 'DOB_str', 'street_num', 'street_name'),
partial.match = c('FirstName', 'LastName', 'street_name'),
gender.field = 'Gender',
cut.a = 0.95, cut.p = 0.85,
em.obj = fs.out.10$EM
)

The EM object is the output from a 10% sample of the data, which ran fine and is adding to my confusion. I've tried adjusting the cut.a and cut.p parameters several times (as suggested in a closed issue here that was related) and it hasn't helped.

The error I receive is this:
If you set return.all to FALSE, you will not be able to calculate a confusion table as a summary statistic.
Calculating matches for each variable.
Getting counts for parameter estimation.
Parallelizing calculation using OpenMP. 7 threads out of 8 are used.
Imputing matching probabilities using provided EM object.
Error in p.gamma.k.m[[i]] : subscript out of bounds

Any assistance would be appreciated. Thank you!

Using reweight.names in fastlink() returns only completely NA rows

I've run the fastLink function both with and without the reweight.names option to ensure the data is matched without issue otherwise.

Code:

fastLink(dfA = dfA, dfB = dfB, varnames = c("first", "last", "company"), stringdist.match = c("first", "last", "company"), stringdist.method = "lv", return.df = TRUE, reweight.names = TRUE, firstname.field = "first", dedupe.matches = FALSE, verbose = TRUE)

The matched data output includes NA cases; each field for each case is "NA":

image

Any idea what's gone wrong here? Thank you for looking into this.

Guidance on improving chances EM algorithm will converge?

@tedenamorado and @kosukeimai thanks so much for making all of this hard work available in this R package! I'm wondering if you had published some guidance or suggestions on what situations lead the EM algorithm to fail to converge.

Unfortunately, my data is not shareable so I'm having trouble giving you a reprex but, broadly, I'm linking birth data with hospitalization data for many different years and I'm having trouble pinpointing what is causing a failure to converge. Sometimes it does, sometimes it doesn't converge.

It does seem that if I exclude any record with any NA value I get convergence more often. But I'd really like to keep these records and the proportion of NA in the variables (max 4.5%) does not "seem" too high. Excluding NA values, in any case, is not a solution that works often.

I'm running the linkage, in many cases, on a 200k subsample in my efforts to figure out where the issue is. Some facts:

  1. In most cases, I'm using DOB, last name, first name, race and municipal code
  2. None of these variables is more than 4.5% missing

Any guidance on what I might do to improve the chances the EM algorithm will converge?

lnk <- fastLink::fastLink(
  dfA = dfA,
  dfB = dfB,
  varnames =  c("lk_dob", "lk_last", "lk_first", "lk_race", "lk_muni_res"),
  # dob as string match with cut of 0.95 will give a match for a one-digit difference in last few numbers
  stringdist.match = c("lk_dob", "lk_last", "lk_first"),
  cut.a = 0.95,
  dedupe.matches = FALSE,
  threshold.match =  0.975,
  verbose = TRUE
)

Vignette: Missing "gender" variable in the example datasets

Hey,

I have tried to replicate the examples form the vignette and I got stuck at Blocking part.
I cannot find the "gender" variable in the dfA and dfB datasets provided by the library.

Could you add the variable in the data?

Best regards,
Mateusz Najsztub

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.