Git Product home page Git Product logo

blink's Introduction

BLINK logo

BLINK is an Entity Linking python library that uses Wikipedia as the target knowledge base.

The process of linking entities to Wikipedia is also known as Wikification.

news

  • (September 2020) added ELQ - end-to-end entity linking on questions
  • (3 July 2020) added FAISS support in BLINK - efficient exact/approximate retrieval

BLINK architecture

The BLINK architecture is described in the following paper:

@inproceedings{wu2019zero,
 title={Zero-shot Entity Linking with Dense Entity Retrieval},
 author={Ledell Wu, Fabio Petroni, Martin Josifoski, Sebastian Riedel, Luke Zettlemoyer},
 booktitle={EMNLP},
 year={2020}
}

https://arxiv.org/pdf/1911.03814.pdf

In a nutshell, BLINK uses a two stages approach for entity linking, based on fine-tuned BERT architectures. In the first stage, BLINK performs retrieval in a dense space defined by a bi-encoder that independently embeds the mention context and the entity descriptions. Each candidate is then examined more carefully with a cross-encoder, that concatenates the mention and entity text. BLINK achieves state-of-the-art results on multiple datasets.

ELQ architecture

ELQ does end-to-end entity linking on questions. The ELQ architecture is described in the following paper:

@inproceedings{li2020efficient,
 title={Efficient One-Pass End-to-End Entity Linking for Questions},
 author={Li, Belinda Z. and Min, Sewon and Iyer, Srinivasan and Mehdad, Yashar and Yih, Wen-tau},
 booktitle={EMNLP},
 year={2020}
}

https://arxiv.org/pdf/2010.02413.pdf

For more detail on how to run ELQ, refer to the ELQ README.

Use BLINK

1. Create conda environment and install requirements

(optional) It might be a good idea to use a separate conda environment. It can be created by running:

conda create -n blink37 -y python=3.7 && conda activate blink37
pip install -r requirements.txt

2. Download the BLINK models

The BLINK pretrained models can be downloaded using the following script:

chmod +x download_blink_models.sh
./download_blink_models.sh

We additionally provide a FAISS indexer in BLINK, which enables efficient exact/approximate retrieval for biencoder model.

To build and save FAISS (exact search) index yourself, run python blink/build_faiss_index.py --output_path models/faiss_flat_index.pkl

3. Use BLINK interactively

A quick way to explore the BLINK linking capabilities is through the main_dense interactive script. BLINK uses Flair for Named Entity Recognition (NER) to obtain entity mentions from input text, then run entity linking.

python blink/main_dense.py -i

Fast mode: in the fast mode the model only uses the bi-encoder, which is much faster (accuracy drops slightly, see details in "Benchmarking BLINK" section).

python blink/main_dense.py -i --fast

To run BLINK with saved FAISS index, run:

python blink/main_dense.py --faiss_index flat --index_path models/faiss_flat_index.pkl

or

python blink/main_dense.py --faiss_index hnsw --index_path models/faiss_hnsw_index.pkl

Example:

Bert and Ernie are two Muppets who appear together in numerous skits on the popular children's television show of the United States, Sesame Street.

Output:

Note: passing --show_url argument will show the Wikipedia url of each entity. The id number displayed corresponds to the order of entities in the entity.jsonl file downloaded from ./download_models.sh (starts from 0). The entity.jsonl file contains information of one entity per row (includes Wikipedia url, title, text, etc.).

4. Use BLINK in your codebase

pip install -e [email protected]:facebookresearch/BLINK#egg=BLINK
import blink.main_dense as main_dense
import argparse

models_path = "models/" # the path where you stored the BLINK models

config = {
    "test_entities": None,
    "test_mentions": None,
    "interactive": False,
    "top_k": 10,
    "biencoder_model": models_path+"biencoder_wiki_large.bin",
    "biencoder_config": models_path+"biencoder_wiki_large.json",
    "entity_catalogue": models_path+"entity.jsonl",
    "entity_encoding": models_path+"all_entities_large.t7",
    "crossencoder_model": models_path+"crossencoder_wiki_large.bin",
    "crossencoder_config": models_path+"crossencoder_wiki_large.json",
    "fast": False, # set this to be true if speed is a concern
    "output_path": "logs/" # logging directory
}

args = argparse.Namespace(**config)

models = main_dense.load_models(args, logger=None)

data_to_link = [ {
                    "id": 0,
                    "label": "unknown",
                    "label_id": -1,
                    "context_left": "".lower(),
                    "mention": "Shakespeare".lower(),
                    "context_right": "'s account of the Roman general Julius Caesar's murder by his friend Brutus is a meditation on duty.".lower(),
                },
                {
                    "id": 1,
                    "label": "unknown",
                    "label_id": -1,
                    "context_left": "Shakespeare's account of the Roman general".lower(),
                    "mention": "Julius Caesar".lower(),
                    "context_right": "'s murder by his friend Brutus is a meditation on duty.".lower(),
                }
                ]

_, _, _, _, _, predictions, scores, = main_dense.run(args, None, *models, test_data=data_to_link)

Benchmarking BLINK

We provide scripts to benchmark BLINK against popular Entity Linking datasets. Note that our scripts evaluate BLINK in a full Wikipedia setting, that is, the BLINK entity library contains all Wikipedia pages.

To benchmark BLINK run the following commands:

./scripts/get_train_and_benchmark_data.sh
python scripts/create_BLINK_benchmark_data.py
python blink/run_benchmark.py

The following table summarizes the performance of BLINK for the considered datasets.

dataset biencoder accuracy (fast mode) biencoder recall@10 biencoder recall@30 biencoder recall@100 crossencoder normalized accuracy overall unnormalized accuracy support
AIDA-YAGO2 testa 0.8145 0.9425 0.9639 0.9826 0.8700 0.8212 4766
AIDA-YAGO2 testb 0.7951 0.9238 0.9487 0.9663 0.8669 0.8027 4446
ACE 2004 0.8443 0.9795 0.9836 0.9836 0.8870 0.8689 244
aquaint 0.8662 0.9618 0.9765 0.9897 0.8889 0.8588 680
clueweb - WNED-CWEB (CWEB) 0.6747 0.8223 0.8609 0.8868 0.826 0.6825 10491
msnbc 0.8428 0.9303 0.9546 0.9676 0.9031 0.8509 617
wikipedia - WNED-WIKI (WIKI) 0.7976 0.9347 0.9546 0.9776 0.8609 0.8067 6383
TAC-KBP 20101 0.8898 0.9549 0.9706 0.9843 0.9517 0.9087 1019

1 Licensed dataset available here.

The BLINK knowledge base

The BLINK knowledge base (entity library) is based on the 2019/08/01 Wikipedia dump, downloadable in its raw format from http://dl.fbaipublicfiles.com/BLINK/enwiki-pages-articles.xml.bz2

BLINK with solr as IR system

The first version of BLINK uses an Apache Solr based Information Retrieval system in combination with a BERT based cross-encoder. This IR-based version is now deprecated since it's outperformed by the current BLINK architecture. If you are interested in the old version, please refer to this README.

Troubleshooting

If the module cannot be found, preface the python command with PYTHONPATH=.

License

BLINK is MIT licensed. See the LICENSE file for details.

blink's People

Contributors

ankushkundaliya avatar belindal avatar dependabot[bot] avatar fabiopetroni avatar kaisugi avatar krisbukovi avatar ledw avatar martinj96 avatar nicola-decao avatar pandemosth avatar ruanchaves avatar scottyih avatar sriniiyer avatar svlandeg avatar tbonza avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

blink's Issues

Making BLINK run from code.

Hi

A couple of things to make BLINK work from code:

  1. In the example code you provided in the readme, you would have to add "faiss_index" and "index_path" to config because they are required by the load_model in main_dense.py as args. Also add "top_k" as it is required by run.
  2. main_dense.run in the sample code provided in readme won't work as it is. It will give "run got multiple values for argument logger". To fix this, remove the keyword "logger=" and just pass None like below:
 _, _, _, _, _, predictions, scores, = main_dense.run(args, None, *models, test_data=data_to_link)
  1. If None is passed as logger, then the code in main_dense.py breaks because it is not able to handle a None in the logger keyword. To fix that, add "if logger" above all the lines which involve logger. Eg: Line 369.
if logger:
    logger.info("interactive mode")
  1. crossencoder/train_cross.py also does not handle None values for logger.

It would be great if these could be fixed. Really useful tool!
Thanks
Rishabh Joshi

Will you be releasing the knowledge base training script?

Hi

This is exciting. There's a chance we'll use it at CNN. I'm interested in the knowledge base training, however. We'd want to be able to retrain on occasion, even if it takes a few weeks on 8 gpus as you mentioned. Can you provide more insight into whether you'll open those scripts?

RuntimeError: [enforce fail at CPUAllocator.cpp:56]

I am tried the code given under 4. Use BLINK in your codebase. I have set "fast" = True in the config. But, when I execute main_dense.load_models(args, logger=None), I get a following Runtime Error - [enforce fail at CPUAllocator.cpp:56] posix_memalign(&data, gAlignment, nbytes) == 0. 12 vs 0 . With a quick search on the internet, I figured that it is due to insufficient RAM. I am currently using 32 GB RAM. The error occurs at this location while loading the entity encodings.

I would like to know if there is a solution for this problem other than using a high-end RAM.

Results under bert-base model

Hi, I am very interested in this work and reproducing the results, but I have to use bert-base model instead of the large one due to my limited resources.
Have you trained and tested BLINK(bert-base) on TACKBP-2010 and WikilinksNED Unseen-Mentions datasets? If so, could you please tell me the accuracy results on these two datasets to check my reproducing results ? (for example on WikilinksNED Unseen-Mentions , the 'Wiki (bi-encoder)' acc , the 'Wiki and in-domain' acc etc. )
Thank you!

ZeroDivisionError: float division by zero Error

Hi, I set up the environment for interactive testing BLINK.
When I run any of these commands:

PYTHONPATH=. python blink/main_dense.py -i
PYTHONPATH=. python blink/main_dense.py -i --faiss_index hnsw --index_path models/faiss_hnsw_index.pkl
PYTHONPATH=. python blink/main_dense.py -i --faiss_index flat --index_path models/faiss_flat_index.pkl

and enter the text as : "federer plays tennis", I get the following error:

insert text:federer plays tennis

federer plays tennis

2020-08-17 23:45:57,810 preparing data for biencoder
2020-08-17 23:45:57,811 run biencoder
0it [00:00, ?it/s]

fast (biencoder) predictions:

federer plays tennis


0it [00:00, ?it/s]
Evaluation: 0it [00:00, ?it/s]
Traceback (most recent call last):
  File "blink/main_dense.py", line 680, in <module>
    run(args, logger, *models)
  File "blink/main_dense.py", line 506, in run
    context_len=biencoder_params["max_context_length"],
  File "blink/main_dense.py", line 276, in _run_crossencoder
    res = evaluate(crossencoder, dataloader, device, logger, context_len, silent=False)
  File "<BLINKPATH>/BLINK/blink/crossencoder/train_cross.py", line 95, in evaluate
    normalized_eval_accuracy = eval_accuracy / nb_eval_examples
ZeroDivisionError: float division by zero

However, if I run this command:

PYTHONPATH=. python blink/main_dense.py -i --fast

it works perfectly (not able to link "federer" though, but doesn't crash with the error).

Also,
For the first set of commands above, if I write "roger federer plays tennis", it works and is able to link roger federer. However, if I write any text (for eg: i like to play tennis), it crashes with the same ZeroDivisionError. I guess it is not able to handle cases when it is not able to find any match. Surprisingly it works for the --fast case.

Are entity IDs internal or do they match Wikidata or WIkipedia's Ids?

Hi,

I'm considering this library for Entity linking, but within my application I need to find the corresponding Wikipedia URL.
Using your example, I tried Blink's Bert Id(97598) as Q97598 and pageid:97598 but I get completely different entities, so I'm wondering how I could get the KB id.

Wikidata results, using https://www.wikidata.org/w/api.php?action=wbgetentities&format=xml&props=sitelinks&ids=Q97598&sitefilter=frwiki

<api success="1">
<script/>
<entities>
<entity type="item" id="Q97598">
<sitelinks>
<sitelink site="frwiki" title="Else Quecke">
<badges/>
</sitelink>
</sitelinks>
</entity>
</entities>
</api>

Results as page ID using https://en.wikipedia.org/w/api.php?action=query&prop=info&pageids=97598&inprop=url

{
    "batchcomplete": "",
    "query": {
        "pages": {
            "97598": {
                "pageid": 97598,
                "ns": 1,
                "title": "Talk:Geneva County, Alabama",
                "contentmodel": "wikitext",
                "pagelanguage": "en",
                "pagelanguagehtmlcode": "en",
                "pagelanguagedir": "ltr",
                "touched": "2020-05-05T21:27:47Z",
                "lastrevid": 805052689,
                "length": 3123,
                "fullurl": "https://en.wikipedia.org/wiki/Talk:Geneva_County,_Alabama",
                "editurl": "https://en.wikipedia.org/w/index.php?title=Talk:Geneva_County,_Alabama&action=edit",
                "canonicalurl": "https://en.wikipedia.org/wiki/Talk:Geneva_County,_Alabama"
            }
        }
    }
}

Thanks for any help on this

enwiki-20190801-pages-articles not available

Hi, thanks for open sourcing this framework.

When I use the command ./blink/candidate_retrieval/scripts/get_processed_data.sh data, I got 404 Not Found error for downloading https://dumps.wikimedia.org/enwiki/20190801/enwiki-20190801-pages-articles.xml.bz2.

This seems to be because Wikimedia only provides six latest dumps (as seen here) and outdated dumps become unavailable.

I believe simply using the latest one (e.g. https://dumps.wikimedia.org/enwiki/latest/enwiki-20190801-pages-articles.xml.bz2) will solve the error, but not sure if there is any other route to download old dumps when we want to use the same version of dumps.

Thanks!

Questions about fine-tuning

In the original paper, for TAK-KBP 2010 dataset,

We use a subset of all Wikipedia linked mentions as our training data (A total of 9M examples)

My understanding is that fine-tuning for model which encodes both of the mention and entities are done like Gillick et al., 2019.

But here, it is written that fine-tuning was conducted under AIDA sets.

I can't find out, or am confusing about how fine-tuning was done. Particularly, in the point of datasets used.
If you'd know details of the training datasets for fine-tuning models, I'd appreciate it.
Thanks.

PS
Maybe I was mistaken.
In the original paper, it is written that BERT base is used for training.
Does this mean BERT base "architecture" for training? I mean, does the original paper say that 9M mentions were used for training BERT weights, which is to be evaluated for TACKBP datasets?
If so, by what means fine-tuning was conducted?

Error calling load_models from code rather than command line

logger.info("Using faiss index to retrieve entities.")

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-12-5a468ebd62da> in <module>
      1 global logger
----> 2 models = main_dense.load_models(args, logger=logger)

/data/BLINK/blink/main_dense.py in load_models(args, logger)
    313         faiss_indexer,
    314     ) = _load_candidates(
--> 315         args.entity_catalogue, args.entity_encoding, faiss_index=args.faiss_index, index_path=args.index_path,
    316     )
    317 

/data/BLINK/blink/main_dense.py in _load_candidates(entity_catalogue, entity_encoding, faiss_index, index_path)
    104         indexer = None
    105     else:
--> 106         logger.info("Using faiss index to retrieve entities.")
    107         candidate_encoding = None
    108         assert index_path is not None, "Error! Empty indexer path."

NameError: name 'logger' is not defined

This appears to be relying on a logger instance in global scope that is only assigned in main

When I run python blink/build_faiss_index.py --output_path models/faiss_flat_index.pkl

I get an error Traceback (most recent call last):
File "blink/build_faiss_index.py", line 14, in
from blink.indexer.faiss_indexer import DenseFlatIndexer, DenseHNSWFlatIndexer
ModuleNotFoundError: No module named 'blink.indexer'

Then I changed the line 14, with "index" instead of indexer.

After that I get the below error

python blink/build_faiss_index.py --output_path models/faiss_flat_index.pkl
10/14/2020 03:02:33 - INFO - Blink - Loading candidate encoding from path: /private/home/ledell/BLINK-Internal/models/all_entities_large.t7
Traceback (most recent call last):
File "blink/build_faiss_index.py", line 74, in
main(params)
File "blink/build_faiss_index.py", line 26, in main
candidate_encoding = torch.load(params["candidate_encoding"])
File "/dccstor/sasdana27/el4qa/lib/python3.7/site-packages/torch/serialization.py", line 381, in load
f = open(f, 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/private/home/ledell/BLINK-Internal/models/all_entities_large.t7'

Looks like some path issue. I have the file 'all_entities_large.t7' in BLINK/models/

Flair version is old

It looks like Flair v0.4.3 is not working, due to the broken link in which they store the embeddings.

flairNLP/flair#1831

This problem prevents BLINK from being used interactively. I guess Flair v0.6.1 solves this bug, but upgrading may badly affect other libraries BLINK depends on...

Training Cross encoder

Hi,
I have trained biencoder. Would like to train cross encoder.
Can you please give an example input file for training cross encoder?
Thanks in advance.

No module named 'XXXXXX'

Executing python under linux prompts module not found,E.g:from blink.indexer.faiss_indexer import DenseFlatIndexer, DenseHNSWFlatIndexer --->>>No module named 'blink'

How can I solve this problem?

how to update the BLINK knowledge base ?

when "mention" is "COVID-19", blink can not link to the right wiki title because the BLINK knowledge base (entity library) is based on the 2019/08/01 Wikipedia dump.

BLINK checkpoint

Could you release the checkpoint of the BLINK model which trained on Zero-shot Entity Linking dataset

use own NER results

Hi thanks for the amazing tool.

We are wondering if it would be possible that we could specify our own NER results (i.e., span of mentions) instead of using flair's.

Thanks!

SolrError while running data_ingestion.py

Hi, thanks for open-sourcing this framework. I got an error while running blink/candidate_retrieval/data_ingestion.py for preprocessing. It gives SolrError: Solr responded with an error (HTTP 404): [Reason: Error 404 Not Found] when the data is added in solr (solr.add(temp_data, commit=True)). I checked if solr is not running properly, but since bin/solr status gives Solr process 49321 running on port 8983, it seems to be running properly. Is there any possible reason that causes the error? Thank you!

Support for more languages

It looks like this architecture would work for non-english languages too. Wikipedia is availiable in more languages, flair has embeddings in other languages, and BERT is available elsewhere.

Is there something stopping this from being applied to eg. Swedish?

Problems in the code regarding Zero-shot EL dataset

@ledw I was training the Bi-Encoder on Zero-shot EL dataset.
I found out that the "load_entity_dict_zeshel" function in the "zeshel_utils.py" file uses only the first 256 characters of the entity description to create entity representation. While the paper mention the input to the entity representation model is both entity title and first ten sentences of the description.
What is the reason for this difference?

Also, I have trained the Bi-Encoder and cross encoder with the instructions provided on the git. I am getting Un-Normalized accuracy: 55.01% with BERT-base while the paper mention 61.34%.
I think one of the main reasons for the difference in the results is negative sampling, which is currently not implemented in the code.

Is there any plan to release the implementation of negative sampling?

ELQ failing at faiss index search while training on WebQSP dataset

When trying to train the ELQ model on the WebQSP dataset, encountered the following error.

Traceback (most recent call last):
  File "elq/biencoder/train_biencoder.py", line 602, in <module>
    main(params)
  File "elq/biencoder/train_biencoder.py", line 293, in main
    logger=logger, faiss_index=cand_encs_index,
  File "elq/biencoder/train_biencoder.py", line 101, in evaluate
    top_cand_logits_shape, top_cand_indices_shape = faiss_index.search_knn(embedding_ctxt, 10)
  File "/data/entity-linking/ELQ/elq/index/faiss_indexer.py", line 123, in search_knn
    scores, indexes = self.index.search(query_vectors, top_k)
  File "/data/anaconda3/envs/el4qa2/lib/python3.7/site-packages/faiss/__init__.py", line 132, in replacement_search
    n, d = x.shape
ValueError: too many values to unpack (expected 2)

Training the biencoder

I was trying to reproduce the code. However, I am facing issues with training the biencoder. Could you please tell how to finetune the biencoder on the datasets mentioned in the paper? Also, can you provide the WikilinksNED Unseen-Mentions dataset which you used? @fabiopetroni

Generating .t7 file for inferencing

Hello,
I am trying to generate .t7 file for a trained model. For that I am running scripts/generate_candidates.py . This python file needs another input file saved_candidates_ids. How do I create this candidate_ids file?
Any pointer will help me to run inference code.

Required operating environment

What platform does everyone run on? I ran the program in Baidu's AI studio but the platform crash and program terminated.

I want to change the environment and platform to run the program. Can you recommend it?

ABOUT RAM:? .Disk:? and so on

index, and not indexer?

When I did

python3 blink/build_faiss_index.py --output_path models/faiss_flat_index.pkl
I got

Traceback (most recent call last):
  File "blink/build_faiss_index.py", line 14, in <module>
    from blink.indexer.faiss_indexer import DenseFlatIndexer, DenseHNSWFlatIndexer
ModuleNotFoundError: No module named 'blink.indexer'

But it looks like the directory is index and not indexer

Understanding 'use BLINK in your codebase'

Hi this is exciting. I'm looking at the use in your codebase snippet.

One question I had is about the data_to_link collection.

It appears the extracted entities are input to the model for link prediction. I'm wondering where this collection comes, if there is a helper script to generate it using NER or if this is designed to work with any NER library for contextual linking. I'm also using Flair so I could hook this code into an existing pipeline. If you have another script to format the data or extract entities that would be interesting to know about.

Pre-train on Wikipedia dump: Questions about data

Hello,

Nice paper! 😃
I want to train the bi-encoder as described in section 5.2.2 of your paper and have some questions about the data that you used.

Can you clarify how the subset of the linked mentions is selected?

we pre-train our models on Wikipedia
data. We use the May 2019 English Wikipedia
dump which includes 5.9M entities, and use the
hyperlinks in articles as examples (the anchor text
is the mention). We use a subset of all Wikipedia
linked mentions as our training data (A total of 9M
examples).

What is the format of the input data for training the model?
train_biencoder.py tries to load training data from a train.jsonl. Can you give a few example rows for such a file?

Is get_processed_data.sh used to process the data?
The name would suggest so, lol. However the README.md of that folder says [deprecated], so I am not sure. (Maybe you could remove the deprecated code from the repository, and use a release tag instead for the old code.)

Could you upload the processed training data?

Question about the training of Section 5.2.2 TACKBP datasets

I am still very interested in your paper and would like to ask you a question.

Is the BERT used for the TACKBP dataset pre-trained in advance? The paper doesn't seem to make this explicit.
That is, as well as "5.2.1 Zero-shot Entity Linking", did you finetune the pretrained BERT and train Bi-encoder at the same time? If so, was in-batch negative sampling used for training, like [Gillick et al., '19] or [Humeau et al., '20] ? (And then re-finetune with the top100 retreived candidates?)

The summary is that I'm confusing about the sentence in Sec 5.2.2.

we pre-train our models on Wikipedia data

Does this mean you trained BERT from scratch, or used pretrained BERT and trained bi-encoder (, which leads to fine-tune BERT weights during the training of bi-encoder)?

If you'd know about this, I'd appreciate it.

Solr error while querying

I am trying to generate the candidates from step 2 with:

python blink/candidate_retrieval/perform_and_evaluate_candidate_retrieval_multithreaded.py \
--num_threads 70 \
--dump_mentions \
--dump_file_id train_and_eval_data \
--include_aida_train

However, while querying the collection, Solr gives an exception:

SolrError("Failed to connect to server at ... Failed to establish a new connection: [Errno 111] 
Connection  refused',))",)

More specifically, the exception occurs in file candidate_generators.py, on line 85:

results = solr.search(query, **self.query_arguments)

If I check the GUI and run the queries there, it works fine.
What could possibly go wrong? Solr seems to be running fine as well.

Thanks!

ModuleNotFoundError: No module named 'blink'

When I run python blink/main_dense.py -i
I am getting an error which I am unable to fix.
Traceback (most recent call last): File "blink/main_dense.py", line 6, in <module> import blink.ner as NER ModuleNotFoundError: No module named 'blink'

My sys.path returns a list which contains <rest-of-the-path>\BLINK\\blink. Please help.

Reproduce the recall result on Zero-shot EL dataset

Hi, I use the code and Hyper-parameters you released on github to train bert-base-uncased on the Zero-shot EL dataset, but I can't get the result you showed on paper, I want to know how should I adjust the Hyper-parameters?
The following are Hyper-parameters I used to train:
learning_rate 1e-05
num_train_epochs 5
max_context_length 128
max_cand_length 128
train_batch_size 128
eval_batch_size 64
bert_model bert-base-uncased
type_optimization all_encoder_layers

Doesn't work on Colab

Getting this error when trying to run it: tcmalloc: large alloc 24180850688 bytes

Have followed all instructions.

python blink/main_dense.py -i is the command it fails on.

JSON Web API

Hi, we've been using BLINK for some work at UMD and have been finding it helpful to create web API wrappers around various entity linkers to make it easier for multiple projects to use. Earlier this week I implemented the first pass on one for BLINK, and wanted to check if there is interest in me creating a PR for it. Thanks!

The code is here: https://github.com/EntilZha/BLINK/blob/master/blink/main_api.py

and run with python blink/main_api.py --fast --mode api

image

The response looks like this:

{
  "samples": [
    {
      "label": "unknown",
      "label_id": -1,
      "context_left": "",
      "context_right": " was a british computer scientist known for creating the turing machine.",
      "mention": "alan turing",
      "start_pos": 0,
      "end_pos": 11,
      "sent_idx": 0
    },
    {
      "label": "unknown",
      "label_id": -1,
      "context_left": "alan turing was a ",
      "context_right": " computer scientist known for creating the turing machine.",
      "mention": "british",
      "start_pos": 18,
      "end_pos": 25,
      "sent_idx": 0
    },
    {
      "label": "unknown",
      "label_id": -1,
      "context_left": "alan turing was a british computer scientist known for creating the ",
      "context_right": ".",
      "mention": "turing machine",
      "start_pos": 68,
      "end_pos": 82,
      "sent_idx": 0
    }
  ],
  "linked_entities": [
    {
      "idx": 0,
      "sample": {
        "label": "unknown",
        "label_id": -1,
        "context_left": "",
        "context_right": " was a british computer scientist known for creating the turing machine.",
        "mention": "alan turing",
        "start_pos": 0,
        "end_pos": 11,
        "sent_idx": 0
      },
      "entity_id": 308,
      "entity_title": "Alan Turing",
      "entity_text": " Alan Mathison Turing (; 23 June 1912 – 7 June 1954) was an English mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist. Turing was highly influential in the development of theoretical computer science, providing a formalisation of the concepts of algorithm and computation with the Turing machine, which can be considered a model of a general-purpose computer. Turing is widely considered to be the father of theoretical computer science and artificial intelligence. Despite these accomplishments, he was never fully recognised in his home country during his lifetime, due to his homosexuality, which was then a crime in the UK, and because his work was covered by the Official Secrets Act.  During the Second World War, Turing worked for the Government Code and Cypher School (GC&CS) at Bletchley Park, Britain's codebreaking centre that produced Ultra intelligence. For a time he led Hut 8, the section that was responsible for German naval cryptanalysis. Here, he devised a number of techniques for speeding the breaking of German ciphers, including improvements to the pre-war Polish bombe method, an electromechanical machine that could find settings for the Enigma machine.  Turing played a pivotal role in cracking intercepted coded messages that enabled the Allies to defeat the Nazis in many crucial engagements, including the Battle of the Atlantic, and in so doing helped win the war. Due to the problems of counterfactual history, it's hard to estimate what effect Ultra intelligence had on the war, but at the upper end it has been estimated that this work shortened the",
      "url": "https://en.wikipedia.org/wiki?curid=1208",
      "crossencoder": false
    },
    {
      "idx": 1,
      "sample": {
        "label": "unknown",
        "label_id": -1,
        "context_left": "alan turing was a ",
        "context_right": " computer scientist known for creating the turing machine.",
        "mention": "british",
        "start_pos": 18,
        "end_pos": 25,
        "sent_idx": 0
      },
      "entity_id": 15734,
      "entity_title": "United Kingdom",
      "entity_text": " The United Kingdom of Great Britain and Northern Ireland, commonly known as the United Kingdom (UK) or Britain, is a sovereign country located off the north-western coast of the European mainland. The United Kingdom includes the island of Great Britain, the north-eastern part of the island of Ireland, and many smaller islands. Northern Ireland is the only part of the United Kingdom that shares a land border with another sovereign state, the Republic of Ireland. Apart from this land border, the United Kingdom is surrounded by the Atlantic Ocean, with the North Sea to the east, the English Channel to the south and the Celtic Sea to the south-west, giving it the 12th-longest coastline in the world. The Irish Sea lies between Great Britain and Ireland. The United Kingdom's were home to an estimated 66.0 million inhabitants in 2017.  The United Kingdom is a unitary parliamentary democracy and constitutional monarchy. The current monarch is Queen Elizabeth II, who has reigned since 1952, making her the world's longest-serving current head of state. The United Kingdom's capital and largest city is London, a global city and financial centre with an urban area population of 10.3 million. Other major cities include Birmingham, Manchester, Glasgow, Leeds and Liverpool.  The United Kingdom consists of four constituent countries: England, Scotland, Wales, and Northern Ireland. Their capitals are London, Edinburgh, Cardiff, and Belfast, respectively. Apart from England, the countries have their own devolved governments, each with varying powers, but such power is delegated by the Parliament of the United Kingdom,",
      "url": "https://en.wikipedia.org/wiki?curid=31717",
      "crossencoder": false
    },
    {
      "idx": 2,
      "sample": {
        "label": "unknown",
        "label_id": -1,
        "context_left": "alan turing was a british computer scientist known for creating the ",
        "context_right": ".",
        "mention": "turing machine",
        "start_pos": 68,
        "end_pos": 82,
        "sent_idx": 0
      },
      "entity_id": 15078,
      "entity_title": "Turing machine",
      "entity_text": " A Turing machine is a mathematical model of computation that defines an abstract machine, which manipulates symbols on a strip of tape according to a table of rules. Despite the model's simplicity, given any computer algorithm, a Turing machine capable of simulating that algorithm's logic can be constructed.  The machine operates on an infinite memory tape divided into discrete \"cells\". The machine positions its \"head\" over a cell and \"reads\" or \"scans\" the symbol there. Then, as per the symbol and its present place in a \"finite table\" of user-specified instructions, the machine (i) writes a symbol (e.g., a digit or a letter from a finite alphabet) in the cell (some models allowing symbol erasure or no writing), then (ii) either moves the tape one cell left or right (some models allow no motion, some models move the head), then (iii) (as determined by the observed symbol and the machine's place in the table) either proceeds to a subsequent instruction or halts the computation.  The Turing machine was invented in 1936 by Alan Turing, who called it an \"a-machine\" (automatic machine). With this model, Turing was able to answer two questions in the negative: (1) Does a machine exist that can determine whether any arbitrary machine on its tape is \"circular\" (e.g., freezes, or fails to continue its computational task); similarly, (2) does a machine exist that can determine whether any arbitrary machine on its tape ever prints a given symbol. Thus by providing a mathematical description of a very simple device capable",
      "url": "https://en.wikipedia.org/wiki?curid=30403",
      "crossencoder": false
    }
  ]
}

How much RAM does data processing consume at peak?

I'm unable to process data on machine with 64GB RAM, my estimation is that ./blink/candidate_retrieval/scripts/get_processed_data.sh data requires at least 160-190 GB RAM. My estimations for blink/candidate_retrieval/enrich_data.py are that it will require even more RAM.

How much RAM should be available to fully process data, load it to SOLR and run demo (section "Use BLINK" from readme)?

Corrupted file

Hi, I've downloaded all the files and I could run the interactive fast version. However when I set the fast flag to False I'll get the following error:

RuntimeError: unexpected EOF, expected 76565490 more bytes. The file might be corrupted.

I assume the crossencoder_wiki_large.bin file might be the problem. The version I downloaded is 1.2GB and I tried redownloading it several times but it didn't help. Could you please look into this?

Benchmark Results

Hi,

I would like to ask a question regarding the benchmark results on widely used EL datasets like AIDA-YAGO and msnbc. I noticed that the reported accuracies are much lower than older non-transformer-based models which is surprising considering the high recall@100. Is this purely due to not fine-tuning on the training set of those datasets?

what threshold for the score to accept the prediction ?

I'm using BLINK for entity mapping
I want to know what is the interval of values for the prediction scores, to be able to fix an acceptable threshold to consider the predicted linking as true. When looking at some example, I've noticed that the values could be positive or negative, and when the returned label from Wikipedia is true, the score is highly positive, is my assumption true ? how we interpret the values of the score ?
Thanks in advance

Error in generating candidate encodings for custom KB with eval_biencoder.py

I'm trying to generate and store candidate encoding for a custom KB. For this I am using eval_biencoder.py script.

I have found two issues in the eval_biencoder.py :

  1. Typo at line #42 "degug" instead of "debug" is passed to params.
  2. main() is directly calling encode_candidate_zeshel() for generating candidate encoding without checking candidate_pool type. candidate_pool for non zeshel is not a dictionary but list and hence throughs error at line #131.

image

Where can I find BLINK's training script?

Hi BLINK's authors,
Thanks for releasing the pretrained model and the code! They are very useful.
I just wonder where I can find the training script to train the bi-encoder, cross-encoder models on my own data.

ELQ has the training script available at: https://github.com/facebookresearch/BLINK/tree/master/elq_slurm_scripts
Could anyone point me to the corresponding training script for BLINK? I'm looking at the entity linking task for long texts.

Thanks in advance!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.