Git Product home page Git Product logo

dblink's People

Contributors

ngmarchant avatar resteorts avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dblink's Issues

Data from d-blink paper

Hi,
Thanks for your paper!
I'm looking for datasets for the task of Entity Resolution. Can you share the preprocessed data used in "d-blink: Distributed End-to-End Bayesian Entity Resolution"?
Best regards,
Mariia

Java / Hadoop versions

We are working to to have our tech support create a EMR cluster for us to test your public data, and we want to ensure that our versions are correct to replicate your findings.

  1. Your guide recommends the use of openJDK. The current version of openJDK is Java SE 13. Daniel had an error while running d-blink with Java 13, which went away when he switched to version 8. Have you tested d-blink with Java 13? Can you indicate the version of openJDK / Java that used used for testing? Should I assume version 8?

  2. Which version of JVM should we be using?

  3. Which version of Hadoop did you use?

10K sample runs

dblink application ran good on RLdata10000.csv with samplesize of 100 and 1000. Sample size of 1000 produced the same results as in published paper for RLdata10000.csv .

Ran dblink application with 10K sample on data files (RLdata10000.csv, EDMSimulationv2.mac.csv) , both ran for long time (more than 6 hours) and failed. Attached here spark-submit log files (after removing IP address and server names). Also attached config file used to run the application.

After certain stages , the application is running on few cores even though spark-submit provided 345 cores (attached below a sample screenshot)

ABSEmployee_log.txt
RLdata10000_log.txt
ABSEmployee_64partitions_PCG-I.txt.conf.txt
RLdata10000.conf.txt
Capture

Path for Spark checkpoints

When running in yarn mode , it has below warning message.

WARN SparkContext: Spark is not running in local mode, therefore the checkpoint directory must not be on the local filesystem. Directory '/tmp/spark_checkpoint/' appears to be on the local filesystem.

Development environment

Hi ,

Is it possible to share your development environment information in the documentation? What kind of Spark cluster you are using , is it on AWS or individual standalone cluster? Are you using Elipse or Intellij Idea software or any other software for IDE? Is Scala IDE or Scala IDE plugin is used?

These are more of questions to understand the application better than an issue.
Thank you,
Yathish

Java compiler

We're limited to the software that our IT will allow. We are trying to get IntelliJ IDEA installed as indicated in the guide.

Can we bypass this issue by using the prebuilt JAR file, which the guide says is recommended? If so, can you fix the link in the guide? It looks like the link is a placeholder that hasn't been filed in.

SHIW0810

d-blink is issuing an error on data set SHIW0810 with sample size of 10K , application is crashing.

On the same data set with sample size of 1000 , it is causing a soft error however application is producing results.

Both driver logs are attached here.

hard_error_driver_log.txt
soft_error_driver_log.txt

Thank you,
Yathish

Relationship betwen numLevels and Spark executors

Hi Neil, I have a few questions about the numLevels parameter and dblink's partitioning.

  1. Does the numLevels control the mathematical partitions, the spark partitions, or both? We've gotten mixed answers to this question in the past. Our experience with dblink suggests that the numLevels parameter controls the spark partitions as we havent been able to run a job with more partitions than those specified by numLevels. I think previously, Beka indicated that numLevels controls the mathematical partitions.

  2. In order to get the ABSEmployee data, around 600k records, we had to change the config file that you provided to lower the numLevels from 6 to 4. While running with numLevels=6, we got a "bad partition" error, and we lowered the parameter on Beka's suggestion and everything ran correctly. With the 3 datasets that we've worked with, it seems that as the number of observations increase, the numLevels parameter needs to be reduced. Is this accurate? If so, it will pose a big challenge for processing datasets with millions of files because we won't be able farm out the extra work to more executors.

  3. If numLevels is not related to the dataset size, what are the data characterstics / factors that are going to determine how high the numLevels parameter can be set?

  4. Are you or Beka working on a change that would let the mathematical and spark paritions be controlled separately so that the number of physical partitions can be increased? If I recall, Beka mentioned last month that you were working on an "indexing" issue, is that related to partitioning issue?

Controlling log

log4j.properties controls the log generated in Spark driver and worker nodes.
can we change
log4j.rootCategory=INFO, console, file
to
log4j.rootCategory=WARN, console
in log4j.properties file

Config files for reproducing experiments

Hi Neil,

We've successfully run the dblink on our system and the evaluation metrics for RLdata10000 don't match those in your paper.

Can you provide the config files for RLdata10000 that you used in your paper? We were testing the config file you included in the examples on the repo. We're also planning to test the ABSEmployee and SHIW0810 data that you provided. Can you provide config files for those as well?

How do you want us to provide the results to you? We can zip the output and post it here if that doesn't violate data any of your data agreements.

Thanks,
Casey

Source code and Jar file

Hello ,

Looks like the jar file is not generated from the same Source code on Git. The jar file is creating additional step in output run file. Source code project.scala does not have this step.

Scheduled steps

  • SampleStep: Evolving the chain from new initial state with sampleSize=100, burninInterval=0, thinningInterval=10 and sampler=PCG-I
  • SummarizeStep: Calculating summary quantities {'cluster-size-distribution', 'partition-sizes'} along the chain for iterations >= 0
  • EvaluateStep: Evaluating sMPC clusters (computed from the chain for iterations >= 100) using {'pairwise', 'cluster'} metrics

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.