SURE is an unsupervised system for relationship extraction relying on Sentence level Distributional Semantics (i.e., sentence enocoding). For more details please refer to: The paper
You need to have Python 3.6 or above and the following libraries installed:
NLTK: http://www.nltk.org/
sKlearn: https://scikit-learn.org/stable/
Numpy: https://numpy.org/
Sentence-Transformers: https://www.sbert.net/
Pandas: https://pandas.pydata.org/
Pytorch: https://pytorch.org/
which you can install issuing the following command:
pip install -r requirements.txt
To run the relation extraction system use the following command:
python main.py corpus entity_type1 entity_type2
between_length: 6 # Maximum number of tokens between two entities
before_after_window: 3 # Maximum number of tokens before the first entity and maximum number of tokens after second entity
similiraty: 0.25 # Cosine similirity threshold during the first iteration
top_similar: 15 # Maximum number of top similar sentences to the query term using cosine similarity
query_term: born in # Natural language representation for relationship birthPlace
A sample sentence in the corpus is one sentence per line, with tags identifing the named type of named-entities, e.g.:
<ORG> Consolidated Edison </ORG>, based in <LOC> New York </LOC>, generated more than $7 billion in annual revenue.
The social media platform <ORG> Facebook,Inc.</ORG> announced it was acquiring <ORG>WhatsApp</ORG>, its largest acquisition to date.
<LOC> Herzogenaurach </LOC> is the home of goods company <ORG> Adidas<ORG>.
Two named entity types are provided initally for a particular relation for example, for rleation headquarterIn we provide ORG (Organization) and LOC (Location).
Dataset | Download |
---|---|
NYT-FB dataset | Download |
Wikipedia_Wikidata dataset | Download |
English gigaword | Download |
python main.py "text_corpus" PER LOC 2
- Manzoor Ali (DICE, Paderborn University)
- Muhammad Saleem (AKSW, University of Leipzig)
- Axel-Cyrille Ngonga Ngomo (DICE, Paderborn University)