Git Product home page Git Product logo

Comments (11)

tedenamorado avatar tedenamorado commented on August 18, 2024

Hi Dan,

Thanks for using fastLink! Believe me, it is through the feedback of the users that we have been able to improve the package in very meaningful ways.

There is a typo in the paper, it should read: first and last name. In the RLdata500 dataset, there are two sources for last names. However, one of them (lname_c2) has missing values almost everywhere - there are only 8 observed values out of 500. We do not use that variable in our replication code. In the simulations settings, we have shown that when the amount of missing information is large then that can lead to problems - basically, the parameters of the model can be way off the truth.

However, even if you were to exclude that variable from your code, I just find out that the wrapper has a bug when performing a deduplication exercise. We will fix that soon and let you know when the wrapper has been fixed.

Wrapper aside, the following lines of code should reproduce the exercise we did in the paper. The code below follows the step-by-step procedure we describe here.

library('RecordLinkage')
RLdata500$id <- identity.RLdata500

library('fastLink')
## Create Agreement Vectors
g1 <- gammaCKpar(RLdata500$fname_c1, RLdata500$fname_c1, cut.a = 0.94, cut.p = 0.88)
g2 <- gammaCKpar(RLdata500$lname_c1, RLdata500$lname_c1, cut.a = 0.94, cut.p = 0.88)
g3 <- gammaKpar(RLdata500$by, RLdata500$by)
g4 <- gammaKpar(RLdata500$bm, RLdata500$bm)
g5 <- gammaKpar(RLdata500$bd, RLdata500$bd)
nr <- nrow(RLdata500)

## Count Patterns + EM
counts <- tableCounts(list(g1, g2, g3, g4, g5), nobs.a = nr, nobs.b = nr)
resEM <- emlinkMARmov(counts, nobs.a = nr, nobs.b = nr)

## Matches
matches <- matchesLink(list(g1, g2, g3, g4, g5), nobs.a = nr, nobs.b = nr, em = resEM, thresh = 0.85)

## Duplicates: there should be 600, 500 perfect matches + 100 duplicates 
## while there are only 50 duplicates in the data
## finding that row 1 in A is a duplicate of row 2 in B
## is equivalent to row 2 in A is a duplicate of row 1 in B
matches.1 <- RLdata500[matches$inds.a, ]
matches.2 <- RLdata500[matches$inds.b, ]

I hope the code above helps! If you have further questions, just let us know.

Ted

from fastlink.

1danjordan avatar 1danjordan commented on August 18, 2024

Hi Ted,

Thanks a million for your quick response! After a good bit of fiddling and reading, I realised that I wasn't using the fastLink wrapper correctly because I wasn't passing variables the birth date variables into the numeric.match argument. Doing this resulted in an error, here's the traceback:

data("RLdata500", package = "RecordLinkage")

# prep data 
rl_data <- RLdata500 %>% 
    as_tibble %>% 
    mutate_if(is.factor, as.character) %>% 
    mutate(n = row_number())

matching_vars     <- c("fname_c1", "lname_c1", "by", "bm", "bd")

rl_matches <- fastLink(
  dfA                = rl_data, 
  dfB                = rl_data,
  varnames           = c("fname_c1", "lname_c1", "by", "bm", "bd"),
  stringdist.match   = c("fname_c1", "lname_c1"),
  numeric.match      = c("by", "bm", "bd")
  )
Error in calcPWDcpp(matchesA[, varnames[i]], matchesB[, varnames[i]]) : 
  Not a matrix.
4.
stop(structure(list(message = "Not a matrix.", call = calcPWDcpp(matchesA[, varnames[i]], matchesB[, varnames[i]]), cppstack = NULL), .Names = c("message", "call", "cppstack"), class = c("Rcpp::not_a_matrix", "C++Error", "error", "condition")))
3.
calcPWDcpp(matchesA[, varnames[i]], matchesB[, varnames[i]])
2.
dedupeMatches(matchesA = dfA[matches$inds.a, ], matchesB = dfB[matches$inds.b, ], EM = resultsEM, matchesLink = matches, varnames = varnames, stringdist.match = stringdist.match, numeric.match = numeric.match, partial.match = partial.match, linprog = linprog.dedupe, ...
1.
fastLink(dfA = rl_data, dfB = rl_data, varnames = c("fname_c1", "lname_c1", "by", "bm", "bd"), stringdist.match = c("fname_c1", "lname_c1"), numeric.match = c("by", "bm", "bd"))

I'm assuming that this is the error that you have run into?

Also, I'll go ahead and use the functions directly from the package like you've suggested. Thank you for the example above.

Cheers,
Dan

from fastlink.

tedenamorado avatar tedenamorado commented on August 18, 2024

Exactly, Dan! We are working on fixing that issue. However, the code I posted does what we describe in the paper. When we wrote the paper we did not have a function to compare distances for numeric variables, now we have one.

We are constantly trying to incorporate new functions that help with record linkage projects, that is why, if you have any suggestions, do not hesitate to let us know.

Cheers!

Ted

from fastlink.

tedenamorado avatar tedenamorado commented on August 18, 2024

Hi Dan,

We have push a fix that solves the issue. Please, install fastLink again from GitHub and try the lines you wrote above.

If anything, please let us know.

Ted

from fastlink.

katharinax avatar katharinax commented on August 18, 2024

Hi,

I am a new user of fastLink. @tedenamorado, thank you for developing this very useful package, and @dandermotj, thank you for starting this active thread.

As listed below, I am still experiencing the two issues you mentioned. The fastLink version I am using is 0.3.1 published on 2018-02-01. Running on R version 3.4.4. I wasn't able to find a package version newer than this. Is it still under development?

  1. Passing in any numeric.match arguments will result in Error in calcPWDcpp(matchesA[, varnames[i]], matchesB[, varnames[i]]) : Not a matrix.

  2. Unable to return all matches without de-duplication. More specifically, when I specify return.all = FALSE, an error message pops up during the step for Calculating the posterior for each pair of matched observations, saying that Error in fix.by(by.x, x) : 'by' must match number of columns. However, when return.all is set to TRUE, I believe dedupe.matches gets overridden to TRUE as well.

Appreciate it.

Best,
Katie

from fastlink.

aalexandersson avatar aalexandersson commented on August 18, 2024

@katharinax Katie, I assure you as an active user that fastLink is very much under active development by the developers. Use the latest development version on GitHub if you need something newer than the stable version on CRAN. If you still have a problem, please make sure to provide a reproducible example.

Anders

from fastlink.

katharinax avatar katharinax commented on August 18, 2024

@aalexandersson I see; will try the development version. And I'll provide reproducible examples if I have further questions. Thanks!

from fastlink.

tedenamorado avatar tedenamorado commented on August 18, 2024

@katharinax thanks for using fastLink. As noted above by @aalexandersson, we have fixed the issue on GitHub - we are planning to push a new version to CRAN soon with that fix included.

If you are using a PC or Linux machine, then installing from GitHub via devtools should be straightforward. If you are using a MAC, then installing from GitHub requires an additional step (happy to help if that is the case).

Please, keep us posted!

All the best,

Ted

from fastlink.

katharinax avatar katharinax commented on August 18, 2024

Hi @tedenamorado,

I am using a MAC and I'm actually not that familiar with R.

My current plan is to git clone your repo, and then link the source code to my own project by doing source("localGitRepoRootFolder/R/fastLink.R") Not sure if this sounds stupid, ha ;) Love to hear your suggestion!

Thanks,
Katie

from fastlink.

tedenamorado avatar tedenamorado commented on August 18, 2024

Hi Katie,

The problem is that you need OpenMP to work on your Mac. The following is a fantastic explanation of how to make that happen:

http://thecoatlessprofessor.com/programming/openmp-in-r-on-os-x/

The other thing that you might need to do is to update the command line tools

xcode-select --install

Hope this helps! If anything, let us know.

All the best,

Ted

from fastlink.

katharinax avatar katharinax commented on August 18, 2024

@tedenamorado This is very enlightening. Will look into this now. I am also downloading the devtools package. Thanks for all the pointers! Much appreciated!

Katie

from fastlink.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.