Hi, thanks for the great project! It seems like stanza performs some

Multi-word token expansion issue, misaligned tokens --> failed NER (German) about spacy-stanza HOT 4 OPEN

flipz357 commented on June 22, 2024

Multi-word token expansion issue, misaligned tokens --> failed NER (German)

from spacy-stanza.

Comments (4)

Eorg90 commented on June 22, 2024 2

If you still have an issue with this, or for anyone coming after me and having the same issue. I was able fix this "quick and dirty".
In the "get_tokens_with_heads" method of the "StanzaTokenizer" class (tokenizer.py line 194) I added the following lines to the for loop iterating tokens of the sentence befor iterating the words in the tokens:

    def get_tokens_with_heads(self, snlp_doc):
        """Flatten the tokens in the Stanza Doc and extract the token indices
        of the sentence start tokens to set is_sent_start.

        snlp_doc (stanza.Document): The processed Stanza doc.
        RETURNS (list): The tokens (words).
        """
        tokens = []
        heads = []
        offset = 0
        for sentence in snlp_doc.sentences:
            for token in sentence.tokens:
                # insert this
                if len(token.words) > 1:
                    heads.append(0)
                    word = token.words[0].to_dict()
                    word["text"] = token.text
                    word["lemma"] = " ".join([word.text for word in token.words])
                    tokens.append(Word(word))
                    continue
                # end of insertion
                for word in token.words:

this generates for each multi word token only one word. The text is the actual text of the token and the lemma is the conjunction of the words. The type I fixed to the type of the first word in my multiword token. this solution mainly has in mind the cases of "am, vom, ins, zum" which are always shorts for a preposition and an artikel. Since I consider the artikel to not contain as much information as the preposition, I "copied" the preposition content and overwrote the "text" and "lemma" attributes.
This for sure, is not complete but very quick to implement. My code now runs without errors and collects all information I needed.

from spacy-stanza.

adrianeboyd commented on June 22, 2024

The background is that we originally developed spacy-stanza before NER components were added, so we focused on providing access to the morpho-syntactic annotation, which is annotated on the expanded multi-word tokens rather than on the original text tokens. Since a spacy Doc can only represent one layer of tokenization, we use the expanded multi-word tokens in the returned Doc.

We can't "simply proceed" because the code currently uses the character offsets to add the NER annotation, so if they don't align with the text anymore, it's not trivial to add the annotation to the doc. I think it should be possible to use information from the Document to align the annotations, but it would require some updates to the alignment algorithm in spacy-stanza. (If this is something you'd like to work on, PRs are welcome!)

I'm not sure there's currently a good workaround involving preprocessing. If you only need NER annotation, you could try a pipeline with only tokenize and ner, but I'm not sure whether the ner component depends on the mwt output or not. It's possible it would fail to run, it would run with degraded performance, or it would be totally fine. From a quick look at the docs and the code, I'm not sure which one it would be.

The stanza Document objects do support both layers of annotation, so for now you might consider using stanza directly?

from spacy-stanza.

flipz357 commented on June 22, 2024

Thanks a lot for the explanations! Seems like it's impossible to address well. Also thanks for the proposed solutions, I will try both things. Should I close this issue? Am 2021-05-03 10:16, schrieb Adriane Boyd:

…

The background is that we originally developed spacy-stanza before NER components were added, so we focused on providing access to the morpho-syntactic annotation, which is annotated on the expanded multi-word tokens rather than on the original text tokens. Since a spacy Doc can only represent one layer of tokenization, we use the expanded multi-word tokens in the returned Doc. We can't "simply proceed" because the code currently uses the character offsets to add the NER annotation, so if they don't align with the text anymore, it's not trivial to add the annotation to the doc. I think it should be possible to use information from the Document to align the annotations, but it would require some updates to the alignment alignment in spacy-stanza. (If this is something you'd like to work on, PRs are welcome!) I'm not sure there's currently a good workaround involving preprocessing. If you only need NER annotation, you could try a pipeline with only tokenize and ner, but I'm not sure whether the ner component depends on the mwt output or not. It's possible it would fail to run, it would run with degraded performance, or it would be totally fine. From a quick look at the docs and the code, I'm not sure which one it would be. The stanza Document objects do support both layers of annotation, so for now you might consider using stanza directly? -- You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub [1], or unsubscribe [2]. Links: ------ [1] #70 (comment) [2] https://github.com/notifications/unsubscribe-auth/AOIPI5BRSZBFYGUWEWRYHE3TLZLVFANCNFSM436DSLYA

from spacy-stanza.

adrianeboyd commented on June 22, 2024

I think it's fine to leave it open. It's not going to be a high priority for us to work on right now, but since I think it should be possible to improve this part of the alignment, this will remind us in the future.

from spacy-stanza.

Multi-word token expansion issue, misaligned tokens --> failed NER (German) about spacy-stanza HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent