Git Product home page Git Product logo

spacy-entity-linker's Introduction

Tests Downloads Current Release Version pypi Version

Spacy Entity Linker

Introduction

Spacy Entity Linker is a pipeline for spaCy that performs Linked Entity Extraction with Wikidata on a given Document. The Entity Linking System operates by matching potential candidates from each sentence (subject, object, prepositional phrase, compounds, etc.) to aliases from Wikidata. The package allows to easily find the category behind each entity (e.g. "banana" is type "food" OR "Microsoft" is type "company"). It can is therefore useful for information extraction tasks and labeling tasks.

The package was written before a working Linked Entity Solution existed inside spaCy. In comparison to spaCy's linked entity system, it has the following advantages:

  • no extensive training required (entity-matching via database)
  • knowledge base can be dynamically updated without retraining
  • entity categories can be easily resolved
  • grouping entities by category

It also comes along with a number of disadvantages:

  • it is slower than the spaCy implementation due to the use of a database for finding entities
  • no context sensitivity due to the implementation of the "max-prior method" for entitiy disambiguation (an improved method for this is in progress)

Installation

To install the package, run:

pip install spacy-entity-linker

Afterwards, the knowledge base (Wikidata) must be downloaded. This can be either be done by manually calling

python -m spacy_entity_linker "download_knowledge_base"

or when you first access the entity linker through spacy. This will download and extract a ~1.3GB file that contains a preprocessed version of Wikidata.

Use

import spacy  # version 3.5

# initialize language model
nlp = spacy.load("en_core_web_md")

# add pipeline (declared through entry_points in setup.py)
nlp.add_pipe("entityLinker", last=True)

doc = nlp("I watched the Pirates of the Caribbean last silvester")

# returns all entities in the whole document
all_linked_entities = doc._.linkedEntities
# iterates over sentences and prints linked entities
for sent in doc.sents:
    sent._.linkedEntities.pretty_print()

# OUTPUT:
# https://www.wikidata.org/wiki/Q194318     Pirates of the Caribbean        Series of fantasy adventure films                                                                   
# https://www.wikidata.org/wiki/Q12525597   Silvester                       the day celebrated on 31 December (Roman Catholic Church) or 2 January (Eastern Orthodox Churches)  

# entities are also directly accessible through spans
doc[3:7]._.linkedEntities.pretty_print()
# OUTPUT:
# https://www.wikidata.org/wiki/Q194318     Pirates of the Caribbean        Series of fantasy adventure films

EntityCollection

contains an array of entity elements. It can be accessed like an array but also implements the following helper functions:

  • pretty_print() prints out information about all contained entities
  • print_super_classes() groups and prints all entites by their super class
doc = nlp("Elon Musk was born in South Africa. Bill Gates and Steve Jobs come from the United States")
doc._.linkedEntities.print_super_entities()
# OUTPUT:
# human (3) : Elon Musk,Bill Gates,Steve Jobs
# country (2) : South Africa,United States of America
# sovereign state (2) : South Africa,United States of America
# federal state (1) : United States of America
# constitutional republic (1) : United States of America
# democratic republic (1) : United States of America

EntityElement

each linked Entity is an object of type EntityElement. Each entity contains the methods

  • get_description() returns description from Wikidata
  • get_id() returns Wikidata ID
  • get_label() returns Wikidata label
  • get_span(doc) returns the span from the spacy document that contains the linked entity. You need to provide the current doc as argument, in order to receive an actual spacy.tokens.Span object, otherwise you will receive a SpanInfo emulating the behaviour of a Span
  • get_url() returns the url to the corresponding Wikidata item
  • pretty_print() prints out information about the entity element
  • get_sub_entities(limit=10) returns EntityCollection of all entities that derive from the current entityElement (e.g. fruit -> apple, banana, etc.)
  • get_super_entities(limit=10) returns EntityCollection of all entities that the current entityElement derives from (e.g. New England Patriots -> Football Team))

Usage of the get_span method with SpanInfo:

import spacy
nlp = spacy.load('en_core_web_md')
nlp.add_pipe("entityLinker", last=True)
text = 'Apple is competing with Microsoft.'
doc = nlp(text)
sents = list(doc.sents)
ent = doc._.linkedEntities[0]

# using the SpanInfo class
span = ent.get_span()
print(span.start, span.end, span.text) # behaves like a Span

# check equivalence
print(span == doc[0:1]) # True
print(doc[0:1] == span) # TypeError: Argument 'other' has incorrect type (expected spacy.tokens.span.Span, got SpanInfo)

# now get the real span
span = ent.get_span(doc) # passing the doc instance here
print(span.start, span.end, span.text)

print(span == doc[0:1]) # True
print(doc[0:1] == span) # True

Example

In the following example we will use SpacyEntityLinker to find find the mentioned Football Team in our text and explore other football teams of the same type

doc = nlp("I follow the New England Patriots")

patriots_entity = doc._.linkedEntities[0]
patriots_entity.pretty_print()
# OUTPUT:
# https://www.wikidata.org/wiki/Q193390     
# New England Patriots            
# National Football League franchise in Foxborough, Massachusetts                    

football_team_entity = patriots_entity.get_super_entities()[0]
football_team_entity.pretty_print()
# OUTPUT:
# https://www.wikidata.org/wiki/Q17156793 
# American football team          
# organization, in which a group of players are organized to compete as a team in American football   


for child in football_team_entity.get_sub_entities(limit=32):
    print(child)
    # OUTPUT:
    # New Orleans Saints
    # New York Giants
    # Pittsburgh Steelers
    # New England Patriots
    # Indianapolis Colts
    # Miami Seahawks
    # Dallas Cowboys
    # Chicago Bears
    # Washington Redskins
    # Green Bay Packers
    # ...

Entity Linking Policy

Currently the only method for choosing an entity given different possible matches (e.g. Paris - city vs Paris - firstname) is max-prior. This method achieves around 70% accuracy on predicting the correct entities behind link descriptions on wikipedia.

Note

The Entity Linker at the current state is still experimental and should not be used in production mode.

Performance

The current implementation supports only Sqlite. This is advantageous for development because it does not requirement any special setup and configuration. However, for more performance critical usecases, a different database with in-memory access (e.g. Redis) should be used. This may be implemented in the future.

Data

the knowledge base was derived from this dataset: https://www.kaggle.com/kenshoresearch/kensho-derived-wikimedia-data

It was cleaned and post-procesed, including filtering out entities of "overrepresented" categories such as

  • village in China
  • train stations
  • stars in the Galaxy
  • etc.

The purpose behind the knowledge base cleaning was to reduce the knowledge base size, while keeping the most useful entities for general purpose applications.

Currently, the only way to change the knowledge base is a bit hacky and requires to replace or modify the underlying sqlite database. You will find it under site_packages/data_spacy_entity_linker/wikidb_filtered.db. The database contains 3 tables:

Versions:

  • spacy_entity_linker>=0.0 (requires spacy>=2.2,<3.0)
  • spacy_entity_linker>=1.0 (requires spacy>=3.0)

TODO

  • implement Entity Classifier based on sentence embeddings for improved accuracy
  • implement get_picture_urls() on EntityElement
  • retrieve statements for each EntityElement (inlinks + outlinks)

spacy-entity-linker's People

Contributors

dennlinger avatar egerber avatar janskuli avatar jonwiggins avatar martinomensio avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spacy-entity-linker's Issues

Unexpected keyword argument `max_depth`

The below code snippet is responsible for a bug, causing a TypeError: get_chain() got an unexpected keyword argument 'max_depth', due to an incorrect call to self.get_chain().

def get_chain_ids(self, max_depth=10):
if self.chain_ids is None:
self.chain_ids = set([el[0] for el in self.get_chain(max_depth=max_depth)])
return self.chain_ids

It isn't immediately obvious to me what the correct code should be, but a fix should be relatively trivial if the expected behavior is known.

HTML code in output <EntityElement:

Hello,

Thank you for this great alternative. I am currently starting a new project to create a domain-specific knowledge base for NER.
I have tested all the methods in EntityElement. It's working perfectly. Only one strange thing when I run...

for sent in doc.sents:
sent._.linkedEntities.pretty_print()

My output on VS Code and Jupyter comes with HTML code:

<EntityElement: https://www.wikidata.org/wiki/Q194318 Pirates of the Caribbean Series of fantasy adventure films >
<EntityElement: https://www.wikidata.org/wiki/Q12525597 Silvester the day celebrated on 31 December (Roman Catholic Church) or 2 January (Eastern Orthodox Churches)>

Any advice?

Best,

IndexError on Strings containing Certain Characters

When running a basic NLP model like en_core_web_lg with the sole addition of an entityLinker pipe, calling nlp() will throw an IndexError on certain strings, particularly those with certain whitespace characters such as newline characters. The error thrown and the line causing the error is:

`def get_candidates_in_sent(self, sent, doc):
----> root = list(filter(lambda token: token.dep
== "ROOT", sent))[0]
excluded_children = []
candidates = []

IndexError: list index out of range`

I'm running Python version 3.9, spaCy version 3.2.4, and spaCy-entity-linker version 1.0.1

OperationalError: unable to open database file

Hi, every time I run doc = nlp("I watched the Pirates of the Caribbean last silvester") I get the error OperationalError: unable to open database file. Is anyone getting a similar error?

Reproducing the underlying SQLite database

Hi,
I was just looking through some of the TODO's in the README, and found that a general limitation is the available offline DB for querying relevant information. E.g., to obtain URLs for images associated with entities, this would require access to property 18, which is not currently included in the statements table.

Given that Wikidata is also constantly updating their knowledge base (e.g., "COVID-19" is not currently included), I was wondering if there is any chance @egerber still has the filtering script somewhere, which would allow updates to the database and subsequently allow for optimizations in a more general direction.

Best,
Dennis

License

Thank you for the pipeline! Would you mind to add the license file?

Issue downloading the database

Hi!

I had issues downloading the knowledge database:

Running python -m spacy_entity_linker "download_knowledge_base" in the terminal, and also running the example code provided under the title 'USE' resulted in the following error:

urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:992)>

I installed version 1.0.3, spacy 3.5.0 and python 3.11.

Allow setting database path

Related to #29.

Currently, there doesn't seem to be a way to set the database path without modifying the source code for the package.

Interested in discussion

Hi Emanuel, Very interested in your work here. I am a uk-based business analyst interested in understanding more..I wonder if you had a little bit of time to discuss? Please let me know. Gerald

fail to download knowledge base

Fail to download knowledge base using:
python -m spacy_entity_linker "download_knowledge_base"

Error message displayed: HTTP Error 403: Forbidden

Are categories or span.labels_ retrievable?

Is there a way to retrieve an EntityElement's category from wikidata or it's span label? When I try to do for example:

entity.get_span().label_ it only prints out a blank line.

I'm asking because it says in documentation that "the package allows to easily find the category behind each entity (e.g. "banana" is type "food" OR "Microsoft" is type "company")".

Crashes when n_processes > 1

I'm seeing an issue where the pipeline crashes during msgpack serialization when I set n_processes > 1.

I want to leverage all cores on my AWS c5a.2xlarge machine so I have set n_processes = number of cores.

But now my entity linking pipeline is crashing.

Below is a paste of the stacktrace:

Traceback (most recent call last):
  File "/home/ubuntu/workspace/official_benchmark_with_linking.py", line 95, in <module>
    results = process_messages(doc_batch)
  File "/home/ubuntu/workspace/official_benchmark_with_linking.py", line 65, in process_messages
    _ = [d for d in nlp.pipe(input_data,
  File "/home/ubuntu/workspace/official_benchmark_with_linking.py", line 65, in <listcomp>
    _ = [d for d in nlp.pipe(input_data,
  File "/home/ubuntu/workspace/.env/lib/python3.10/site-packages/spacy/language.py", line 1574, in pipe
    for doc in docs:
  File "/home/ubuntu/workspace/.env/lib/python3.10/site-packages/spacy/language.py", line 1657, in _multiprocessing_pipe
    self.default_error_handler(
  File "/home/ubuntu/workspace/.env/lib/python3.10/site-packages/spacy/util.py", line 1672, in raise_error
    raise e
ValueError: [E871] Error encountered in nlp.pipe with multiprocessing:

Traceback (most recent call last):
  File "/home/ubuntu/workspace/.env/lib/python3.10/site-packages/spacy/language.py", line 2273, in _apply_pipes
    byte_docs = [(doc.to_bytes(), doc._context, None) for doc in docs]
  File "/home/ubuntu/workspace/.env/lib/python3.10/site-packages/spacy/language.py", line 2273, in <listcomp>
    byte_docs = [(doc.to_bytes(), doc._context, None) for doc in docs]
  File "spacy/tokens/doc.pyx", line 1316, in spacy.tokens.doc.Doc.to_bytes
  File "spacy/tokens/doc.pyx", line 1375, in spacy.tokens.doc.Doc.to_dict
  File "/home/ubuntu/workspace/.env/lib/python3.10/site-packages/spacy/util.py", line 1312, in to_dict
    serialized[key] = getter()
  File "spacy/tokens/doc.pyx", line 1372, in spacy.tokens.doc.Doc.to_dict.lambda20
  File "/home/ubuntu/workspace/.env/lib/python3.10/site-packages/srsly/_msgpack_api.py", line 14, in msgpack_dumps
    return msgpack.dumps(data, use_bin_type=True)
  File "/home/ubuntu/workspace/.env/lib/python3.10/site-packages/srsly/msgpack/__init__.py", line 55, in packb
    return Packer(**kwargs).pack(o)
  File "srsly/msgpack/_packer.pyx", line 285, in srsly.msgpack._packer.Packer.pack
  File "srsly/msgpack/_packer.pyx", line 291, in srsly.msgpack._packer.Packer.pack
  File "srsly/msgpack/_packer.pyx", line 288, in srsly.msgpack._packer.Packer.pack
  File "srsly/msgpack/_packer.pyx", line 264, in srsly.msgpack._packer.Packer._pack
  File "srsly/msgpack/_packer.pyx", line 282, in srsly.msgpack._packer.Packer._pack
TypeError: can not serialize 'EntityCollection' object

403 Forbidden

Trying to download the knowledge base from https://wikidatafiles.nyc3.digitaloceanspaces.com/Hosting/Hosting/SpacyEntityLinker/datafiles.tar.gz fails with a 403: Forbidden error.

how to use nlp.pipe

Thanks for this useful tool, but I want to use multi-processors to accelerate the NER and ETL. I found spacy provide an NLP.pipe method, it's not compatible with your pipeline. My question is how to use NLP.pipe in your pipeline.

"Max prior" method

Could you please elaborate on how the entity linking algorithm works?

example in README doesn't work

Hi - this is a really interesting project - thanks for making it available. However, the basic example in the README doesn't work with a fresh install (i.e., spacy==2.3.5 and spacy-entity-linker==0.0.5):

doc = nlp("I watched the Pirates of the Carribean last silvester")
all_linked_entities=doc._.linkedEntities
for sent in doc.sents:
    sent._.linkedEntities.pretty_print()

returns the following and resolves Pirates of Carribean to Pittsburgh Pirates:

https://www.wikidata.org/wiki/Q653772     653772     Pittsburgh Pirates              baseball team and Major League Baseball franchise in Pittsburgh, Pennsylvania, United States        
https://www.wikidata.org/wiki/Q12525597   12525597   Silvester                       the day celebrated on 31 December (Roman Catholic Church) or 2 January (Eastern Orthodox Churches) 

Changing the dataset

I'm trying to tweak the dataset to use my own data for a use case, but the model keeps on pointing to the original dataset somehow. Do I have to clone the repo and upload the model to pip?

Translating the database?

First of all, thanks for this great library!

As the title suggests I'm wondering whether it would possible to port this to other natural languages by translating the database using wikidata requests. I had a look at the database and from my very limited understanding of this, I would just translate en_label, en_description (in joined) and rebuild the aliases table based on the "also known as" field in wikidata.

While this seems technically feasible, it is of course quite time-consuming, doing so many requests. Fortunately however, the wikidata api returns all the available languages for each request. More importantly in my particular case I'm only interested in a very limited set of entity types.

My question is: Am I oversimplifying this and missing important details, which would make this more complicated than the idea sketched above?

DatabaseError: database disk image is malformed

Hi there, thank u for offering such a useful tool. However I meet with some issues after downloading the database and running the sample code by encountering DatabaseError: database disk image is malformed. May someone help me with this error? Thanks a lot!

downloaded wikidataset not connected or library not found

When I use the following code:
`# pip install spacy-entity-linker

python -m spacy_entity_linker "download_knowledge_base"

import spacy
nlp = spacy.load("en_core_web_md")
nlp.add_pipe("entity_linker", last=True)
doc = nlp("I watched the Pirates of the Caribbean last silvester")
all_linked_entities = doc._.linkedEntities

for sent in doc.sents:
sent._.linkedEntities.pretty_print()`

I get: 'ValueError: [E139] Knowledge base for component 'entity_linker' is empty. Use the methods kb.add_entity and kb.add_alias to add entries.'
I might need to add the downloaded KG somewhere but it is nowhere stated.

The original code states that add.pipe should be:
nlp.add_pipe("entity_linker", last=True)

But then i get the error: '
ValueError: [E002] Can't find factory for 'entityLinker' for language English (en). This usually happens when spaCy calls nlp.create_pipe with a custom component name that's not registered on the current language class'

Where are things going wrong?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.