Comments (4)
If you still have an issue with this, or for anyone coming after me and having the same issue. I was able fix this "quick and dirty".
In the "get_tokens_with_heads" method of the "StanzaTokenizer" class (tokenizer.py line 194) I added the following lines to the for loop iterating tokens of the sentence befor iterating the words in the tokens:
def get_tokens_with_heads(self, snlp_doc):
"""Flatten the tokens in the Stanza Doc and extract the token indices
of the sentence start tokens to set is_sent_start.
snlp_doc (stanza.Document): The processed Stanza doc.
RETURNS (list): The tokens (words).
"""
tokens = []
heads = []
offset = 0
for sentence in snlp_doc.sentences:
for token in sentence.tokens:
# insert this
if len(token.words) > 1:
heads.append(0)
word = token.words[0].to_dict()
word["text"] = token.text
word["lemma"] = " ".join([word.text for word in token.words])
tokens.append(Word(word))
continue
# end of insertion
for word in token.words:
this generates for each multi word token only one word. The text is the actual text of the token and the lemma is the conjunction of the words. The type I fixed to the type of the first word in my multiword token. this solution mainly has in mind the cases of "am, vom, ins, zum" which are always shorts for a preposition and an artikel. Since I consider the artikel to not contain as much information as the preposition, I "copied" the preposition content and overwrote the "text" and "lemma" attributes.
This for sure, is not complete but very quick to implement. My code now runs without errors and collects all information I needed.
from spacy-stanza.
The background is that we originally developed spacy-stanza
before NER components were added, so we focused on providing access to the morpho-syntactic annotation, which is annotated on the expanded multi-word tokens rather than on the original text tokens. Since a spacy Doc
can only represent one layer of tokenization, we use the expanded multi-word tokens in the returned Doc
.
We can't "simply proceed" because the code currently uses the character offsets to add the NER annotation, so if they don't align with the text anymore, it's not trivial to add the annotation to the doc. I think it should be possible to use information from the Document
to align the annotations, but it would require some updates to the alignment algorithm in spacy-stanza
. (If this is something you'd like to work on, PRs are welcome!)
I'm not sure there's currently a good workaround involving preprocessing. If you only need NER annotation, you could try a pipeline with only tokenize
and ner
, but I'm not sure whether the ner
component depends on the mwt
output or not. It's possible it would fail to run, it would run with degraded performance, or it would be totally fine. From a quick look at the docs and the code, I'm not sure which one it would be.
The stanza Document
objects do support both layers of annotation, so for now you might consider using stanza
directly?
from spacy-stanza.
from spacy-stanza.
I think it's fine to leave it open. It's not going to be a high priority for us to work on right now, but since I think it should be possible to improve this part of the alignment, this will remind us in the future.
from spacy-stanza.
Related Issues (20)
- Support for Spacy 3 HOT 6
- Port trailing whitespace fix to master
- SPACE is not UPOS HOT 4
- ImportError: cannot import name 'hash_unicode' from 'murmurhash' HOT 5
- Spacy-stanza and Spacy conflict when calling pipelines on the GPU HOT 2
- Spacy Tokenization encoding problem HOT 6
- Spacy Tokenizer Boundary Issue. HOT 1
- [W109] Unable to save user hooks while serializing the doc HOT 3
- Question: fine tuning stanza models from within Spacy HOT 1
- stanza.download('en') not working HOT 1
- Streamline behavior when xpos/tag is None HOT 2
- Add stanza constituency output HOT 2
- NER & Parsing not working for new language HOT 2
- AttributeError: module 'spacy_stanza' has no attribute 'load_pipeline' HOT 2
- Upgrade `stanza` version to 1.4.0 in the requirements.txt
- Can't use Spacy-Stanza in a databricks/spark UDF
- how to enable resource.json from local path when spacy_stanza.load_pipeline HOT 2
- Custom sentence segmentization HOT 2
- Building an NER pipeline for languages supported by stanza but not spacy HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spacy-stanza.