cleanzr / dblink Goto Github PK
View Code? Open in Web Editor NEWDistributed Bayesian Entity Resolution in Apache Spark
License: Other
Distributed Bayesian Entity Resolution in Apache Spark
License: Other
Hi,
Thanks for your paper!
I'm looking for datasets for the task of Entity Resolution. Can you share the preprocessed data used in "d-blink: Distributed End-to-End Bayesian Entity Resolution"?
Best regards,
Mariia
Describe the format of the linkage structure samples, diagnostics, and point estimates.
We reran the ABSEmployee file earlier this week, and I'm not seeing a "bad partition" error either. I must have made it up. Sorry about that.
We see something related to the TransportRequestHandler and then a garbage collection memory error. We'll try to troubleshoot tomorrow and next week. Let us know if you have any ideas. We're running 20gb driver memory.
ABSEmployee_numlevel6_log_redact.txt
ABSEmployee_numlevel6_log_redact2.txt
We are working to to have our tech support create a EMR cluster for us to test your public data, and we want to ensure that our versions are correct to replicate your findings.
Your guide recommends the use of openJDK. The current version of openJDK is Java SE 13. Daniel had an error while running d-blink with Java 13, which went away when he switched to version 8. Have you tested d-blink with Java 13? Can you indicate the version of openJDK / Java that used used for testing? Should I assume version 8?
Which version of JVM should we be using?
Which version of Hadoop did you use?
dblink application ran good on RLdata10000.csv with samplesize of 100 and 1000. Sample size of 1000 produced the same results as in published paper for RLdata10000.csv .
Ran dblink application with 10K sample on data files (RLdata10000.csv, EDMSimulationv2.mac.csv) , both ran for long time (more than 6 hours) and failed. Attached here spark-submit log files (after removing IP address and server names). Also attached config file used to run the application.
After certain stages , the application is running on few cores even though spark-submit provided 345 cores (attached below a sample screenshot)
ABSEmployee_log.txt
RLdata10000_log.txt
ABSEmployee_64partitions_PCG-I.txt.conf.txt
RLdata10000.conf.txt
When running in yarn mode , it has below warning message.
WARN SparkContext: Spark is not running in local mode, therefore the checkpoint directory must not be on the local filesystem. Directory '/tmp/spark_checkpoint/' appears to be on the local filesystem.
Hi ,
Is it possible to share your development environment information in the documentation? What kind of Spark cluster you are using , is it on AWS or individual standalone cluster? Are you using Elipse or Intellij Idea software or any other software for IDE? Is Scala IDE or Scala IDE plugin is used?
These are more of questions to understand the application better than an issue.
Thank you,
Yathish
We're limited to the software that our IT will allow. We are trying to get IntelliJ IDEA installed as indicated in the guide.
Can we bypass this issue by using the prebuilt JAR file, which the guide says is recommended? If so, can you fix the link in the guide? It looks like the link is a placeholder that hasn't been filed in.
d-blink is issuing an error on data set SHIW0810 with sample size of 10K , application is crashing.
On the same data set with sample size of 1000 , it is causing a soft error however application is producing results.
Both driver logs are attached here.
hard_error_driver_log.txt
soft_error_driver_log.txt
Thank you,
Yathish
Hi Neil, I have a few questions about the numLevels parameter and dblink's partitioning.
Does the numLevels control the mathematical partitions, the spark partitions, or both? We've gotten mixed answers to this question in the past. Our experience with dblink suggests that the numLevels parameter controls the spark partitions as we havent been able to run a job with more partitions than those specified by numLevels. I think previously, Beka indicated that numLevels controls the mathematical partitions.
In order to get the ABSEmployee data, around 600k records, we had to change the config file that you provided to lower the numLevels from 6 to 4. While running with numLevels=6, we got a "bad partition" error, and we lowered the parameter on Beka's suggestion and everything ran correctly. With the 3 datasets that we've worked with, it seems that as the number of observations increase, the numLevels parameter needs to be reduced. Is this accurate? If so, it will pose a big challenge for processing datasets with millions of files because we won't be able farm out the extra work to more executors.
If numLevels is not related to the dataset size, what are the data characterstics / factors that are going to determine how high the numLevels parameter can be set?
Are you or Beka working on a change that would let the mathematical and spark paritions be controlled separately so that the number of physical partitions can be increased? If I recall, Beka mentioned last month that you were working on an "indexing" issue, is that related to partitioning issue?
log4j.properties controls the log generated in Spark driver and worker nodes.
can we change
log4j.rootCategory=INFO, console, file
to
log4j.rootCategory=WARN, console
in log4j.properties file
Hi Neil,
We've successfully run the dblink on our system and the evaluation metrics for RLdata10000 don't match those in your paper.
Can you provide the config files for RLdata10000 that you used in your paper? We were testing the config file you included in the examples on the repo. We're also planning to test the ABSEmployee and SHIW0810 data that you provided. Can you provide config files for those as well?
How do you want us to provide the results to you? We can zip the output and post it here if that doesn't violate data any of your data agreements.
Thanks,
Casey
Hello ,
Looks like the jar file is not generated from the same Source code on Git. The jar file is creating additional step in output run file. Source code project.scala does not have this step.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.