Git Product home page Git Product logo

wbsg-uni-mannheim / wdc-lspc-v2 Goto Github PK

View Code? Open in Web Editor NEW
15.0 1.0 2.0 39 KB

This repository contains code and data download scripts for the paper "Using schema.org annotations for training and maintaining product matchers" by Ralph Peeters, Anna Primpeli, Benedikt Wichtlhuber and Christian Bizer.

License: BSD 3-Clause "New" or "Revised" License

Jupyter Notebook 93.65% Python 6.35%
schema-org deepmatcher entity-resolution product-matching

wdc-lspc-v2's Introduction

This repository contains code and data download scripts for the paper Using schema.org annotations for training and maintaining product matchers by Ralph Peeters, Anna Primpeli, Benedikt Wichtlhuber and Christian Bizer.

Prerequisites

  1. anaconda (or similar for standard packages)
  2. py_entitymatching
  3. xgboost
  4. deepmatcher

Update: Added an environment yml (wdc-lspc-v2.yml), which can be used to create a conda environment similar to the one used. Simply run conda env create -f wdc-lspc-v2.yml.

Data Preparation

Update: Added scripts to download either the normalized or non-normalized versions of the training/validation/gold standard sets. Please only use one of them by navigating to the src/data/ folder and run e.g. python download_datasets_normalized.py to automatically download the files into the correct locations. You can then find the data at data/raw/. This download does not include the corresponding corpus file. If you need this, you have to download it from the project website yourself. Note that the non-normalized data may need some additional pre-processing, the experiments were done using the normalized data.

(If you do not want to use the download scripts: please download and unzip the WDC LSPC v2 normalized data files into the corresponding folder under data/raw/wdc-lspc/)

  1. Run noise-training-sets notebook <- creates noised training sets
  2. Run process-to-magellan and process-to-wordcooc notebooks <- prepares input data for experiments

Model Learning

Run run-wordcooc, run-magellan or run-deepmatcher notebooks to replicate learning curve and label-noise experiments.

Best found parameters for deepmatcher optimization on computers xlarge

Find the best parameter combinations in the file optimized-parameters.txt

Deepmatcher end-to-end training

To allow for gradient updates of the embedding layer, simply change the line embed.weight.requires_grad = False in models/core.py to True in the deepmatcher package

Code for building of small, medium, large and xlarge training sets

Additional requirement: textdistance

The notebook sample-training-sets contains the code used for building the 4 training sets for each product category

Acknowledgements

Project structure based on Cookiecutter Data Science: https://drivendata.github.io/cookiecutter-data-science/

wdc-lspc-v2's People

Contributors

rpeeters85 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

wdc-lspc-v2's Issues

wdc-lspc-v2.yml not working

Hi,

Could you please help with replicating your notebooks? I tried to create a conda env using the suggested .yml file but anaconda could not resolve the packages.

Could you please help?

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.