Git Product home page Git Product logo

ub-mannheim / bbw Goto Github PK

View Code? Open in Web Editor NEW
67.0 6.0 9.0 5.27 MB

Entity linking, entity typing and relation extraction: Matching CSV to a Wikibase instance (e.g., Wikidata) via Meta-lookup

Home Page: https://ub-mannheim.github.io/bbw/

License: MIT License

Python 85.15% Shell 0.46% Jupyter Notebook 14.39%
wikidata entity-linking relation-extraction knowledge-graph semantic-table-interpretation tabular-data-annotation entity-typing ontology-matching schema-matching wikibase

bbw's Introduction

bbw (boosted by wiki)

PyPI version badge badge Language grade: Python

  • Annotates tabular data with the entities, types and properties in Wikidata.
  • Easy to use: bbw.annotate().
  • Resolves even tricky spelling mistakes via meta-lookup through SearX.
  • Matches to the up-to-date values in Wikidata without the dump files.
  • Ranked in third place at SemTab2020.

Table of contents

How to use

Import library

from bbw import bbw

The easiest way to annotate the dataframe Y is:

[web_table, url_table, label_table, cpa, cea, cta] = bbw.annotate(Y)

It returns a list of six dataframes. The first three dataframes contain the annotations in the form of HTML-links, URLs and labels of the entities in Wikidata correspondingly. The dataframes have two more rows than Y. These two rows contain the annotations for types and properties. The last three dataframes contain the annotations in the format required by SemTab2020 challenge.

The fastest way to annotate the dataframe Y is:

[cpa_list, cea_list, nomatch] = bbw.contextual_matching(bbw.preprocessing(Y))
[cpa, cea, cta] = bbw.postprocessing(cpa_list, cea_list)

The dataframes cpa, cea and cta contain the annotations in SemTab2020-format. The list nomatch contains the labels which are not matched. The unprocessed and possibly non-unique annotations are in the lists cpa_list and cea_list.

GUI

If you need to annotate only one table, use the simple GUI:

streamlit run bbw_gui.py

Open the browser at http://localhost:8501 and choose a CSV-file. The annotation process starts automatically. It outputs the six tables of the annotate function.

Try it out online (no SearX support) with this binder link.

CLI

If you need to annotate a few tables, use the CLI-tool:

python3 bbw_cli.py --amount 100 --offset 0

GNU parallel

If you need to annotate hundreds or thousands of tables, use the script with GNU parallel:

./bbw_parallel.py

Installation

You can use pip to install bbw:

pip install bbw

The latest version can be installed directly from github:

pip install git+https://github.com/UB-Mannheim/bbw

You can test bbw in a virtual environment:

pip install virtualenv
virtualenv testing_bbw
source testing_bbw/bin/activate
python
from bbw import bbw
[web_table, url_table, label_table, cpa, cea, cta] = bbw.annotate(bbw.pd.DataFrame([['0','1'],['Mannheim','Rhine']]))
print(web_table)
deactivate

Install also SearX, because bbw meta-lookups through it.

export PORT=80
docker pull searx/searx
docker run --rm -d -v ${PWD}/searx:/etc/searx -p $PORT:8080 -e BASE_URL=http://localhost:$PORT/ searx/searx

SearX is running on http://localhost:80. bbw sends GET requests to it.

Citing

If you find bbw useful in your work, a proper reference would be:

@inproceedings{2020_bbw,
  author    = {Renat Shigapov and Philipp Zumstein and Jan Kamlah and Lars Oberl{\"a}nder and J{\"o}rg Mechnich and Irene Schumm},
  title     = {bbw: {M}atching {CSV} to {W}ikidata via {M}eta-lookup},
  booktitle = {SemTab@ISWC 2020},
  url = {http://ceur-ws.org/Vol-2775/paper2.pdf},
  volume = {2775},
  pages = {17-26},
  publisher = {CEUR-WS.org},
  year = {2020}
}

[paper] [presentation] [BERD@BW]

SemTab2020

The library was designed, implemented and tested during SemTab2020. It received the best scores in the last 4th round at automatically generated dataset:

Task F1-score Precision Rank
CPA 0.995 0.996 2
CTA 0.980 0.980 2
CEA 0.978 0.984 4

bbw's People

Contributors

jmechnich avatar shigapov avatar stweil avatar zuphilip avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

bbw's Issues

Extracting all candidate scores for the CEA task.

Dear authors,

Thank you for developing this system, we wanted to try to capture all the candidates and the scores they were given when we applied it. We have therefore made these changes in the following link
https://github.com/Yansera/bbw_scoring_extractor/tree/main/BaselineBBW

These modifications include:

  1. line 1143: store the scores of all candidate entities.
  2. line 1228: Return all candidate entities directly.

Could you please verify that our changes are correct and that the final output is mapped?

Thanks in advance

Add option for selecting a language to reconcile with

I tried using this tool to reconcile a list of about 100 church denominations (a gist can be found here). Unfortunately, the results were pretty mediocre (only around 5 got matched) because the list is in Dutch while the matching is done using only English labels.

I think it would be a very useful addition to make sure it's possible to set up the language code. For both the OpenRefine reconciliation endpoint as well as the WD query service this is very easy. Also see my wdreconcile tool for some inspiration on how something like that could be done.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.