davidsbatista / breds Goto Github PK

"Bootstrapping Relationship Extractors with Distributional Semantics" (Batista et al., 2015) in EMNLP'15 - Python implementation

Home Page: https://www.aclweb.org/anthology/D15-1056

License: GNU Lesser General Public License v3.0

Python 99.40% Makefile 0.60%

semantic-relationship-extraction distributional-semantics bootstrapping emnlp nlp vector-representations natural-language-processing

breds's Introduction

BREDS

BREDS extracts relationships using a bootstrapping/semi-supervised approach, it relies on an initial set of seeds, i.e. pairs of named-entities representing relationship type to be extracted.

The algorithm expands the initial set of seeds using distributional semantics to generalize the relationship while limiting the semantic drift.

Extracting companies headquarters:

The input text needs to have the named-entities tagged, like show in the example bellow:

The tech company <ORG>Soundcloud</ORG> is based in <LOC>Berlin</LOC>, capital of Germany.
<ORG>Pfizer</ORG> says it has hired <ORG>Morgan Stanley</ORG> to conduct the review.
<ORG>Allianz</ORG>, based in <LOC>Munich</LOC>, said net income rose to EUR 1.32 billion.
<LOC>Switzerland</LOC> and <LOC>South Africa</LOC> are co-chairing the meeting.
<LOC>Ireland</LOC> beat <LOC>Italy</LOC> , then lost 43-31 to <LOC>France</LOC>.
<ORG>Pfizer</ORG>, based in <LOC>New York City</LOC> , employs about 90,000 workers.
<PER>Burton</PER> 's engine passed <ORG>NASCAR</ORG> inspection following the qualifying session.

We need to give seeds to boostrap the extraction process, specifying the type of each named-entity and relationships examples that should also be present in the input text:

e1:ORG
e2:LOC

Lufthansa;Cologne
Nokia;Espoo
Google;Mountain View
DoubleClick;New York
SAP;Walldorf

To run a simple example, download the following files

- afp_apw_xin_embeddings.bin
- sentences_short.txt.bz2
- seeds_positive.txt

Install BREDS using pip

pip install breads

Run the following command:

breds --word2vec=afp_apw_xin_embeddings.bin --sentences=sentences_short.txt --positive_seeds=seeds_positive.txt --similarity=0.6 --confidence=0.6

After the process is terminated an output file relationships.jsonl is generated containing the extracted relationships.

You can pretty print it's content to the terminal with: jq '.' < relationships.jsonl:

{
  "entity_1": "Medtronic",
  "entity_2": "Minneapolis",
  "confidence": 0.9982486865148862,
  "sentence": "<ORG>Medtronic</ORG> , based in <LOC>Minneapolis</LOC> , is the nation 's largest independent medical device maker . ",
  "bef_words": "",
  "bet_words": ", based in",
  "aft_words": ", is",
  "passive_voice": false
}

{
  "entity_1": "DynCorp",
  "entity_2": "Reston",
  "confidence": 0.9982486865148862,
  "sentence": "Because <ORG>DynCorp</ORG> , headquartered in <LOC>Reston</LOC> , <LOC>Va.</LOC> , gets 98 percent of its revenue from government work .",
  "bef_words": "Because",
  "bet_words": ", headquartered in",
  "aft_words": ", Va.",
  "passive_voice": false
}

{
  "entity_1": "Handspring",
  "entity_2": "Silicon Valley",
  "confidence": 0.893486865148862,
  "sentence": "There will be more firms like <ORG>Handspring</ORG> , a company based in <LOC>Silicon Valley</LOC> that looks as if it is about to become a force in handheld computers , despite its lack of machinery .",
  "bef_words": "firms like",
  "bet_words": ", a company based in",
  "aft_words": "that looks",
  "passive_voice": false
}

BREDS has several parameters to tune the extraction process, in the example above it uses the default values, but these can be set in the configuration file: parameters.cfg

max_tokens_away=6           # maximum number of tokens between the two entities
min_tokens_away=1           # minimum number of tokens between the two entities
context_window_size=2       # number of tokens to the left and right of each entity

alpha=0.2                   # weight of the BEF context in the similarity function
beta=0.6                    # weight of the BET context in the similarity function
gamma=0.2                   # weight of the AFT context in the similarity function

wUpdt=0.5                   # < 0.5 trusts new examples less on each iteration
number_iterations=4         # number of bootstrap iterations
wUnk=0.1                    # weight given to unknown extracted relationship instances
wNeg=2                      # weight given to extracted relationship instances
min_pattern_support=2       # minimum number of instances in a cluster to be considered a pattern

and passed with the argument --config=parameters.cfg.

The full command line parameters are:

  -h, --help            show this help message and exit
  --config CONFIG       file with bootstrapping configuration parameters
  --word2vec WORD2VEC   an embedding model based on word2vec, in the format of a .bin file
  --sentences SENTENCES
                        a text file with a sentence per line, and with at least two entities per sentence
  --positive_seeds POSITIVE_SEEDS
                        a text file with a seed per line, in the format, e.g.: 'Nokia;Espoo'
  --negative_seeds NEGATIVE_SEEDS
                        a text file with a seed per line, in the format, e.g.: 'Microsoft;San Francisco'
  --similarity SIMILARITY
                        the minimum similarity between tuples and patterns to be considered a match
  --confidence CONFIDENCE
                        the minimum confidence score for a match to be considered a true positive
  --number_iterations NUMBER_ITERATIONS
                        the number of iterations the run

Please, refer to the section References and Citations for details on the parameters.

In the first step, BREDS pre-processes the input file sentences.txt generating word vector representations of
relationships (i.e.: processed_tuples.pkl).

This is done so that then you can experiment with different seed examples without having to repeat the process of generating word vectors representations. Just pass the argument --sentences=processed_tuples.pkl instead to skip this generation step.

References and Citations

Semi-Supervised Bootstrapping of Relationship Extractors with Distributional Semantics, EMNLP'15

@inproceedings{batista-etal-2015-semi,
    title = "Semi-Supervised Bootstrapping of Relationship Extractors with Distributional Semantics",
    author = "Batista, David S.  and Martins, Bruno  and Silva, M{\'a}rio J.",
    booktitle = "Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing",
    month = sep,
    year = "2015",
    address = "Lisbon, Portugal",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/D15-1056",
    doi = "10.18653/v1/D15-1056",
    pages = "499--504",
}

"Large-Scale Semantic Relationship Extraction for Information Discovery" - Chapter 5, David S Batista, Ph.D. Thesis

@incollection{phd-dsbatista2016
  title = {Large-Scale Semantic Relationship Extraction for Information Discovery},
    author = {Batista, David S.},
  school = {Instituto Superior Técnico, Universidade de Lisboa},
  year = {2016}
}

Presenting BREDS at PyData Berlin 2017

Contributing to BREDS

Improvements, adding new features and bug fixes are welcome. If you wish to participate in the development of BREDS, please read the following guidelines.

The contribution process at a glance

Preparing the development environment
Code away!
Continuous Integration
Submit your changes by opening a pull request

Small fixes and additions can be submitted directly as pull requests, but larger changes should be discussed in an issue first. You can expect a reply within a few days, but please be patient if it takes a bit longer.

Preparing the development environment

Make sure you have Python3.9 installed on your system

macOs

brew install [email protected]
python3.9 -m pip install --user --upgrade pip
python3.9 -m pip install virtualenv

Clone the repository and prepare the development environment:

git clone [email protected]:davidsbatista/BREDS.git
cd BREDS            
python3.9 -m virtualenv venv         # create a new virtual environment for development using python3.9 
source venv/bin/activate             # activate the virtual environment
pip install -r requirements_dev.txt  # install the development requirements
pip install -e .                     # install BREDS in edit mode

Continuous Integration

BREDS runs a continuous integration (CI) on all pull requests. This means that if you open a pull request (PR), a full test suite is run on your PR:

The code is formatted using black and isort
Unused imports are auto-removed using pycln
Linting is done using pyling and flake8
Type checking is done using mypy
Tests are run using pytest

Nevertheless, if you prefer to run the tests & formatting locally, it's possible too.

make all

Opening a Pull Request

Every PR should be accompanied by short description of the changes, including:

Impact and motivation for the changes
Any open issues that are closed by this PR

Give a ⭐️ if this project helped you!

breds's People

Contributors

Stargazers

Watchers

breds's Issues

No negative seed update

It seem like the negative seeds aren't updated.

for t in list(self.candidate_tuples.keys()):
                    if t.confidence >= self.config.instance_confidence:
                        seed = Seed(t.e1, t.e2)
                        self.config.positive_seed_tuples.add(seed)

Only the positive is updated.

Clarification of pattern.py line 18

Line 18: if tuple is not None:
Line 19: self.tuples.add(t)

tuple is a Python keyword for the tuple class, and the statement in Line 18 will always return True.

Should it be:

if t is not None:
self.tuples.add(t)

? Thank you.

Several errors when running /automatic-evaluation/large-scale-evaluation-freebase.py

Hi again !

The errors mentionned in the title are due to the use of Sentence.py (from BREDS.Sentence import Sentence at line 15 of large-scale-evaluation-freebase.py)
At the begining I assumed that using the module Sentence.py located in /BREDS/ would work. But in fact, I found clues that you may have used a different version of this module when developping large-scale-evaluation-freebase.py.

The clues are :

In the /BREDS/Sentence.py file, you don't handle the case where the pos_tagger and config arguments of __init__ are None (which happen to be the case in the call at line 141 of large-scale-evaluation-freebase.py)
In large-scale-evaluation-freebase.py, in process_corpus(), line 144 you wrote tokens = word_tokenize(r.between) even if r.between is already a pos-tagged string (see line 199 of /BREDS/Sentence.py)

Possible error in selectivity calculation

In the function pattern.update_selectivity, if you have the situation where you hit s.e1 == t.e1 but not s.e2 == t.e2 you will have matched_both = False and matched_e1 = True. Meaning that you will increment the negative length of the pattern. BUT you will also hit the
if matched_both is False: self.unknown += 1.

This will increment the negative and unknown length.

Question about passive voice

In your paper "Semi-Supervised Bootstrapping of Relationship Extractors with Distributional Semantics", you said "BREDS also tries to identify the passive voice using part-of-speech (PoS) tags ,..., it will swap the order of the entities when producing a relational triple".

But I can't find this process in your code. If BREDS detect the passive voice and swap the entities, about the words before the first entity and words after the second entity? Can you give
me some advice? Thank you!

Paralellize instance gathering and pattern scoring

Questions about usage of negative seeds

According to the paper: Semi-Supervised Bootstrapping of Relationship Extractors with Distributional Semantics, it looks like only positive seeds are used in algorithm. Would you mind explaining more on the usage of negative seeds here?

Missing requirements.txt

The README states that there is a requirements.txt file, but it's missing from the repo. A requirements.txt with the right versions of the appropriate libraries would be helpful!

Missing requirements.txt file

Hi !
I can't find the requirements.txt file that you evoke in your readme.md.
This causes me having an error at the call of the method jellyfish.jaro_winkler at lines 787-789 of the script large-scale-evaluation-freebase.py located in /automatic-evaluation/. I think that my version of jellyfish is too recent as compared to yours.
If you still have it, could you please add it to the repo ?

By the way I fixed this script for windows users (there is a difference in the behavior of the multiprocessing package between windows and linux, look at http://rhodesmill.org/brandon/2010/python-multiprocessing-linux-windows/ if you are interrested). Tell me if you would like to see what I have changed.

Thank you for your work, I really appreciate it !

Feature Extraction

The process of feature extraction can be parallelized by first loading all the files into a Queue and then launch separate processes/threads do generate Tuples from each of sentence.

self.config.min_pattern_support in breds_parallel.py clarification

Hi, I'd like to check on the variable self.config.min_pattern_support.

In line 293 (breds_parallel.py), the check for len(p.tuples) > self.config.min_pattern_support occurs under the if condition of len(self.patterns) == 0

However, under the else case, there is no such checks to remove patterns with tuples less than the min support. Is this deliberate by design? Otherwise, should the check also be done for len(self.patterns) != 0 too?

Thank you.

Commands to replicate F1-score of BREDS

Hi there,

Can you please include the exact commands with parameters (including running pre_processing_KB scripts: easy_freebase_clean.py, select_sentences.py and index_whoosh.py) to replicate evaluation numbers of BREDS reported in paper?
Thanks for adding some notes for the ReadMe for automatic-evaluation, but it would really help us if you can also include the exact commands.
[I would really appreciate a timely reply]

Thank you!

If new tuples are found, do i need to tag the new entities in the document collection?

Calculating tuples confidence

Hi,

Thank you for sharing your (clean) code. In the parallel version of BREDS when computing the 'tuples confidence' why you are taking a weighted combination of confidence score from last & current iteration. I understand self.config.wUpdt is a hyperparameter but still why is it needed only in the parallel version ?

Clarification regarding Automatic-Evaluation

Hi,

Sorry for bothering again regarding the automatic-evaluation but this is important for me to know.
In large-scale-evaluation-freebase.py while loading freebase the whole KB is loaded irrespective of the relation type, now consider string_matching_parallel() it is only checked if e1 and e2 occur in the KB but not for which relation? which means if my model predicts a Founder-Of relation between Marissa Mayer and Google than it will be correct becasuse of this line:
if len(database_1[(r.ent1.decode("utf8"), r.ent2.decode("utf8"))]) > 0:
But KB actually contained Marissa Mayer Employment history Google.
Is my understanding correct that it is only checked if two entites occur in a database but not for which relation? (please indicate yes/no for the Marissa Mayer Founder-Of Google example).

Many Thanks!

Data request

Thank you for your appliment, my google account is invalid , could you release another way？Thanks

the system is not working with tags other than "LOC", "ORG" and "PER"

Hello:

When I want to extract a relation between two entities 'ORG' and 'Date' the system doesn't extract any relations. for example
<ORG>NOKIA</ORG> founded in <Date>1865</Date>
However, when I change the 'Date' tag to 'LOC' or 'PER' it works perfectly for example
<ORG>NOKIA</ORG> founded in <LOC>1865</LOC>

NOTE: I changed the negative and positive file ( e1 to ORG and e2 to Date) but still the system can not extract any relation between ORG and Date

How do I place the downloaded Word2VEc model?

Hi, I'm really sorry to bother you with some simple questions. How do I place the downloaded Word2VEc model?

FileNotFoundError: [Errno 2] No such file or directory: 'afp_apw_xin_embeddings.bin'

pattern2vector_sum function in tuple.py retrieving character word vector

In tuple.py, lines 72, 78 and 79 passes bet_filtered, bef_no_tags and and aft_no_tags to function pattern2vector_sum. Note that these variables passed are only the literal words (without POS tags).

However, at line 87, the word vector retrieval appears to be retrieving the zero-th position character of the token passed into this function instead of the expected vector retrieval of the entire word.

i.e.
line 87 :: vector = config.word2vec[t[0].strip()]

Changing line 87 to
vector = config.word2vec[t.strip()]

works for me.