cleanzr / dblink Goto Github PK

View Code? Open in Web Editor NEW

57.0 11.0 9.0 466 KB

Distributed Bayesian Entity Resolution in Apache Spark

License: Other

Scala 100.00%

entity-resolution apache-spark bayesian-inference record-linkage distributed-machine-learning mcmc

dblink's People

Contributors

Stargazers

Watchers

Forkers

charlesa101 yaxche-io karimamer77 christianjauregui cjekal sundarshankar89 zackgow kevinstonetg

dblink's Issues

Data from d-blink paper

Hi,
Thanks for your paper!
I'm looking for datasets for the task of Entity Resolution. Can you share the preprocessed data used in "d-blink: Distributed End-to-End Bayesian Entity Resolution"?
Best regards,
Mariia

Add section in guide describing dblink output

Describe the format of the linkage structure samples, diagnostics, and point estimates.

ABSEmployee with numLevels=6

We reran the ABSEmployee file earlier this week, and I'm not seeing a "bad partition" error either. I must have made it up. Sorry about that.

We see something related to the TransportRequestHandler and then a garbage collection memory error. We'll try to troubleshoot tomorrow and next week. Let us know if you have any ideas. We're running 20gb driver memory.

ABSEmployee_numlevel6_log_redact.txt
ABSEmployee_numlevel6_log_redact2.txt

Java / Hadoop versions

We are working to to have our tech support create a EMR cluster for us to test your public data, and we want to ensure that our versions are correct to replicate your findings.

Your guide recommends the use of openJDK. The current version of openJDK is Java SE 13. Daniel had an error while running d-blink with Java 13, which went away when he switched to version 8. Have you tested d-blink with Java 13? Can you indicate the version of openJDK / Java that used used for testing? Should I assume version 8?
Which version of JVM should we be using?
Which version of Hadoop did you use?

10K sample runs

dblink application ran good on RLdata10000.csv with samplesize of 100 and 1000. Sample size of 1000 produced the same results as in published paper for RLdata10000.csv .

Ran dblink application with 10K sample on data files (RLdata10000.csv, EDMSimulationv2.mac.csv) , both ran for long time (more than 6 hours) and failed. Attached here spark-submit log files (after removing IP address and server names). Also attached config file used to run the application.

After certain stages , the application is running on few cores even though spark-submit provided 345 cores (attached below a sample screenshot)

ABSEmployee_log.txt
RLdata10000_log.txt
ABSEmployee_64partitions_PCG-I.txt.conf.txt
RLdata10000.conf.txt

Path for Spark checkpoints

When running in yarn mode , it has below warning message.

WARN SparkContext: Spark is not running in local mode, therefore the checkpoint directory must not be on the local filesystem. Directory '/tmp/spark_checkpoint/' appears to be on the local filesystem.

Development environment

Hi ,

Is it possible to share your development environment information in the documentation? What kind of Spark cluster you are using , is it on AWS or individual standalone cluster? Are you using Elipse or Intellij Idea software or any other software for IDE? Is Scala IDE or Scala IDE plugin is used?

These are more of questions to understand the application better than an issue.
Thank you,
Yathish

Java compiler

We're limited to the software that our IT will allow. We are trying to get IntelliJ IDEA installed as indicated in the guide.

Can we bypass this issue by using the prebuilt JAR file, which the guide says is recommended? If so, can you fix the link in the guide? It looks like the link is a placeholder that hasn't been filed in.

SHIW0810

d-blink is issuing an error on data set SHIW0810 with sample size of 10K , application is crashing.

On the same data set with sample size of 1000 , it is causing a soft error however application is producing results.

Both driver logs are attached here.

hard_error_driver_log.txt
soft_error_driver_log.txt

Thank you,
Yathish

Relationship betwen numLevels and Spark executors

Hi Neil, I have a few questions about the numLevels parameter and dblink's partitioning.

Does the numLevels control the mathematical partitions, the spark partitions, or both? We've gotten mixed answers to this question in the past. Our experience with dblink suggests that the numLevels parameter controls the spark partitions as we havent been able to run a job with more partitions than those specified by numLevels. I think previously, Beka indicated that numLevels controls the mathematical partitions.
In order to get the ABSEmployee data, around 600k records, we had to change the config file that you provided to lower the numLevels from 6 to 4. While running with numLevels=6, we got a "bad partition" error, and we lowered the parameter on Beka's suggestion and everything ran correctly. With the 3 datasets that we've worked with, it seems that as the number of observations increase, the numLevels parameter needs to be reduced. Is this accurate? If so, it will pose a big challenge for processing datasets with millions of files because we won't be able farm out the extra work to more executors.
If numLevels is not related to the dataset size, what are the data characterstics / factors that are going to determine how high the numLevels parameter can be set?
Are you or Beka working on a change that would let the mathematical and spark paritions be controlled separately so that the number of physical partitions can be increased? If I recall, Beka mentioned last month that you were working on an "indexing" issue, is that related to partitioning issue?

Controlling log

log4j.properties controls the log generated in Spark driver and worker nodes.
can we change
log4j.rootCategory=INFO, console, file
to
log4j.rootCategory=WARN, console
in log4j.properties file

Config files for reproducing experiments

Hi Neil,

We've successfully run the dblink on our system and the evaluation metrics for RLdata10000 don't match those in your paper.

Can you provide the config files for RLdata10000 that you used in your paper? We were testing the config file you included in the examples on the repo. We're also planning to test the ABSEmployee and SHIW0810 data that you provided. Can you provide config files for those as well?

How do you want us to provide the results to you? We can zip the output and post it here if that doesn't violate data any of your data agreements.

Thanks,
Casey

Source code and Jar file

Hello ,

Looks like the jar file is not generated from the same Source code on Git. The jar file is creating additional step in output run file. Source code project.scala does not have this step.

Scheduled steps

SampleStep: Evolving the chain from new initial state with sampleSize=100, burninInterval=0, thinningInterval=10 and sampler=PCG-I
SummarizeStep: Calculating summary quantities {'cluster-size-distribution', 'partition-sizes'} along the chain for iterations >= 0
EvaluateStep: Evaluating sMPC clusters (computed from the chain for iterations >= 100) using {'pairwise', 'cluster'} metrics