explosion / spacy-stanza Goto Github PK

💥 Use the latest Stanza (StanfordNLP) research models directly in spaCy

License: MIT License

Python 100.00%

nlp natural-language-processing machine-learning data-science spacy spacy-pipeline stanford-nlp stanford-corenlp stanford-machine-learning corenlp

spacy-stanza's Introduction

spaCy + Stanza (formerly StanfordNLP)

This package wraps the Stanza (formerly StanfordNLP) library, so you can use Stanford's models in a spaCy pipeline. The Stanford models achieved top accuracy in the CoNLL 2017 and 2018 shared task, which involves tokenization, part-of-speech tagging, morphological analysis, lemmatization and labeled dependency parsing in 68 languages. As of v1.0, Stanza also supports named entity recognition for selected languages.

⚠️ Previous version of this package were available as spacy-stanfordnlp.

Using this wrapper, you'll be able to use the following annotations, computed by your pretrained stanza model:

Statistical tokenization (reflected in the Doc and its tokens)
Lemmatization (token.lemma and token.lemma_)
Part-of-speech tagging (token.tag, token.tag_, token.pos, token.pos_)
Morphological analysis (token.morph)
Dependency parsing (token.dep, token.dep_, token.head)
Named entity recognition (doc.ents, token.ent_type, token.ent_type_, token.ent_iob, token.ent_iob_)
Sentence segmentation (doc.sents)

️️️⌛️ Installation

As of v1.0.0 spacy-stanza is only compatible with spaCy v3.x. To install the most recent version:

pip install spacy-stanza

For spaCy v2, install v0.2.x and refer to the v0.2.x usage documentation:

pip install "spacy-stanza<0.3.0"

Make sure to also download one of the pre-trained Stanza models.

📖 Usage & Examples

⚠️ Important note: This package has been refactored to take advantage of spaCy v3.0. Previous versions that were built for spaCy v2.x worked considerably differently. Please see previous tagged versions of this README for documentation on prior versions.

Use spacy_stanza.load_pipeline() to create an nlp object that you can use to process a text with a Stanza pipeline and create a spaCy Doc object. By default, both the spaCy pipeline and the Stanza pipeline will be initialized with the same lang, e.g. "en":

import stanza
import spacy_stanza

# Download the stanza model if necessary
stanza.download("en")

# Initialize the pipeline
nlp = spacy_stanza.load_pipeline("en")

doc = nlp("Barack Obama was born in Hawaii. He was elected president in 2008.")
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.dep_, token.ent_type_)
print(doc.ents)

If language data for the given language is available in spaCy, the respective language class can be used as the base for the nlp object – for example, English(). This lets you use spaCy's lexical attributes like is_stop or like_num. The nlp object follows the same API as any other spaCy Language class – so you can visualize the Doc objects with displaCy, add custom components to the pipeline, use the rule-based matcher and do pretty much anything else you'd normally do in spaCy.

# Access spaCy's lexical attributes
print([token.is_stop for token in doc])
print([token.like_num for token in doc])

# Visualize dependencies
from spacy import displacy
displacy.serve(doc)  # or displacy.render if you're in a Jupyter notebook

# Process texts with nlp.pipe
for doc in nlp.pipe(["Lots of texts", "Even more texts", "..."]):
    print(doc.text)

# Combine with your own custom pipeline components
from spacy import Language
@Language.component("custom_component")
def custom_component(doc):
    # Do something to the doc here
    print(f"Custom component called: {doc.text}")
    return doc

nlp.add_pipe("custom_component")
doc = nlp("Some text")

# Serialize attributes to a numpy array
np_array = doc.to_array(['ORTH', 'LEMMA', 'POS'])

Stanza Pipeline options

Additional options for the Stanza Pipeline can be provided as keyword arguments following the Pipeline API:

Provide the Stanza language as lang. For Stanza languages without spaCy support, use "xx" for the spaCy language setting:
```
# Initialize a pipeline for Coptic
nlp = spacy_stanza.load_pipeline("xx", lang="cop")
```

Provide Stanza pipeline settings following the Pipeline API:

# Initialize a German pipeline with the `hdt` package
nlp = spacy_stanza.load_pipeline("de", package="hdt")

Tokenize with spaCy rather than the statistical tokenizer (only for English):

nlp = spacy_stanza.load_pipeline("en", processors= {"tokenize": "spacy"})

Provide any additional processor settings as additional keyword arguments:

# Provide pretokenized texts (whitespace tokenization)
nlp = spacy_stanza.load_pipeline("de", tokenize_pretokenized=True)

The spaCy config specifies all Pipeline options in the [nlp.tokenizer] block. For example, the config for the last example above, a German pipeline with pretokenized texts:

[nlp.tokenizer]
@tokenizers = "spacy_stanza.PipelineAsTokenizer.v1"
lang = "de"
dir = null
package = "default"
logging_level = null
verbose = null
use_gpu = true

[nlp.tokenizer.kwargs]
tokenize_pretokenized = true

[nlp.tokenizer.processors]

Serialization

The full Stanza pipeline configuration is stored in the spaCy pipeline config, so you can save and load the pipeline just like any other nlp pipeline:

# Save to a local directory
nlp.to_disk("./stanza-spacy-model")

# Reload the pipeline
nlp = spacy.load("./stanza-spacy-model")

Note that this does not save any Stanza model data by default. The Stanza models are very large, so for now, this package expects you to download the models separately with stanza.download() and have them available either in the default model directory or in the path specified under [nlp.tokenizer.dir] in the config.

Adding additional spaCy pipeline components

By default, the spaCy pipeline in the nlp object returned by spacy_stanza.load_pipeline() will be empty, because all stanza attributes are computed and set within the custom tokenizer, StanzaTokenizer. But since it's a regular nlp object, you can add your own components to the pipeline. For example, you could add your own custom text classification component with nlp.add_pipe("textcat", source=source_nlp), or augment the named entities with your own rule-based patterns using the EntityRuler component.

spacy-stanza's People

Contributors

Stargazers

Watchers

spacy-stanza's Issues

Unknown morphological feature: 'ConjType'

When I run nlp(comment) for Urdu language, I am getting error:
[E167] Unknown morphological feature: 'ConjType' (9141427322507498425). This can happen if the tagger was trained with a different set of morphological features. If you're using a pretrained model, make sure that your models are up to date: python -m spacy validate
Some of the docs work while some don't.

To Reproduce
Following code to get tokens and pos tags:

snlp = stanza.Pipeline(lang='ur') 
nlp = StanzaLanguage(snlp) 
doc = nlp('یہ سرد اور تلخ تھا')

Windows and CentOs
Python3.8
Stanza version: 1.0.0

Other API in SpaCy like noun_chunk

can other APIs also be used in spacy command line but with stanford nlp model

Latest tag is not released on PyPi

Hey there,

I noticed that the latest tag v0.2.4 is not yet available on PyPi.

This new release contains a critical bug fix so I was wondering when will it be available?

Thanks

Error: tokenizer

Versions:
spacy-stanza 0.2.4
stanza 1.1.1

Description: the following string throws error on the tokenizer: "?\n"
How to reproduce error:

import stanza
from spacy_stanza import StanzaLanguage
# tried on 5 different languages with same result
snlp = stanza.Pipeline(lang="en", processors='tokenize')
nlp = StanzaLanguage(snlp)
nlp('?\n')

Update:
Any given character followed by a newline '\n' and no other character produces the same error.
eg.:
nlp("example\n") ->error
nlp("example2\n ") -> error
nlp("example\nend") -> runs

Update 2:
Character followed by two spaces also produce the same error, for some reason special characters work this way.
eg.:
nlp("example ") ->error
nlp("example2 ") -> error
nlp("\n ") -> runs
nlp("\t ") -> runs

Error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-67-48883c9275df> in <module>
----> 1 nlp('?\n')

~/miniconda3/envs/gudgyo/lib/python3.7/site-packages/spacy/language.py in __call__(self, text, disable, component_cfg)
    439                 Errors.E088.format(length=len(text), max_length=self.max_length)
    440             )
--> 441         doc = self.make_doc(text)
    442         if component_cfg is None:
    443             component_cfg = {}

~/miniconda3/envs/gudgyo/lib/python3.7/site-packages/spacy_stanza/language.py in make_doc(self, text)
     65         these will be mapped to token vectors.
     66         """
---> 67         doc = self.tokenizer(text)
     68         if self.svecs is not None:
     69             doc.user_token_hooks["vector"] = self.token_vector

~/miniconda3/envs/gudgyo/lib/python3.7/site-packages/spacy_stanza/language.py in __call__(self, text)
    165         offset = 0
    166         for i, word in enumerate(words):
--> 167             if word.isspace() and word != snlp_tokens[i + offset].text:
    168                 # insert a space token
    169                 pos.append(self.vocab.strings.add("SPACE"))

IndexError: list index out of range

Update 3:
This particular string produces an error (language bulgarian, gpu True, all processors used):
"Думи и срички: Горско училище ......................9 Буквен етап • "

Error:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-70-9dbeed0db463> in <module>
----> 1 nlp('Думи и срички: Горско училище ......................9 Буквен етап • ')

~/miniconda3/envs/gudgyo/lib/python3.7/site-packages/spacy/language.py in __call__(self, text, disable, component_cfg)
    439                 Errors.E088.format(length=len(text), max_length=self.max_length)
    440             )
--> 441         doc = self.make_doc(text)
    442         if component_cfg is None:
    443             component_cfg = {}

~/miniconda3/envs/gudgyo/lib/python3.7/site-packages/spacy_stanza/language.py in make_doc(self, text)
     65         these will be mapped to token vectors.
     66         """
---> 67         doc = self.tokenizer(text)
     68         if self.svecs is not None:
     69             doc.user_token_hooks["vector"] = self.token_vector

~/miniconda3/envs/gudgyo/lib/python3.7/site-packages/spacy_stanza/language.py in __call__(self, text)
    139             return Doc(self.vocab, words=[text], spaces=[False])
    140 
--> 141         snlp_doc = self.snlp(text)
    142         text = snlp_doc.text
    143         snlp_tokens, snlp_heads = self.get_tokens_with_heads(snlp_doc)

~/miniconda3/envs/gudgyo/lib/python3.7/site-packages/stanza/pipeline/core.py in __call__(self, doc)
    164         assert any([isinstance(doc, str), isinstance(doc, list),
    165                     isinstance(doc, Document)]), 'input should be either str, list or Document'
--> 166         doc = self.process(doc)
    167         return doc
    168 

~/miniconda3/envs/gudgyo/lib/python3.7/site-packages/stanza/pipeline/core.py in process(self, doc)
    158         for processor_name in PIPELINE_NAMES:
    159             if self.processors.get(processor_name):
--> 160                 doc = self.processors[processor_name].process(doc)
    161         return doc
    162 

~/miniconda3/envs/gudgyo/lib/python3.7/site-packages/stanza/pipeline/depparse_processor.py in process(self, document)
     46         # build dependencies based on predictions
     47         for sentence in batch.doc.sentences:
---> 48             sentence.build_dependencies()
     49         return batch.doc

~/miniconda3/envs/gudgyo/lib/python3.7/site-packages/stanza/models/common/doc.py in build_dependencies(self)
    479                 # id is index in words list + 1
    480                 head = self.words[word.head - 1]
--> 481                 assert(word.head == head.id)
    482             self.dependencies.append((head, word.deprel, word))
    483 

AssertionError:

However after deleting a single dot from the string, we get the following warning instead of the error:

/home/gudmongyorgy/miniconda3/envs/gudgyo/lib/python3.7/site-packages/ipykernel_launcher.py:1: UserWarning: Due to multiword token expansion or an alignment issue, the original text has been replaced by space-separated expanded tokens.
  """Entry point for launching an IPython kernel.
/home/gudmongyorgy/miniconda3/envs/gudgyo/lib/python3.7/site-packages/ipykernel_launcher.py:1: UserWarning: Can't set named entities because of multi-word token expansion or because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ['Думи', 'и', 'срички', ':', 'Горско', 'училище', '......', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '...9', 'Буквен', 'етап', '•']
Entities: []
  """Entry point for launching an IPython kernel.

With the tokenized output:
"Думи и срички : Горско училище ...... . . . . . . . . . . ...9 Буквен етап •"

User Warnings make parsing Late

I am parsing a big corpus that takes days to index. It is an arabic corpus so I need spacy-stanza.
I have noticed that it is printing for each sentence I parse UserWarning: Can't set named entities because of multi-word token expansion or because the character offsets don't map to valid tokens produced by the Stanza tokenizer
This makes the parsing a lot slower. I suggest to remove these warnings

Which model actually is working?

For this example, it seems the pipeline contains NE models both from stanfordNLP and spacy. How do I know which model is actually producing results? Does this spacy model overwrites the stanfordNLP model due to nlp.add_pipe(ner)?

snlp = stanfordnlp.Pipeline(lang="en", models_dir="./models")
nlp = StanfordNLPLanguage(snlp)

# Load spaCy's pre-trained en_core_web_sm model, get the entity recognizer and
# add it to the StanfordNLP model's pipeline
spacy_model = spacy.load("en_core_web_sm")
ner = spacy_model.get_pipe("ner")
nlp.add_pipe(ner)

doc = nlp("Barack Obama was born in Hawaii. He was elected president in 2008.")
print([(ent.text, ent.label_) for ent in doc.ents])
# [('Barack Obama', 'PERSON'), ('Hawaii', 'GPE'), ('2008', 'DATE')]

ImportError: cannot import name 'StanzaLanguage' from partially initialized module 'spacy_stanza'

Traceback (most recent call last):
File "spacer.py", line 2, in
from spacy_stanza import StanzaLanguage
File "/Users/lex/.pyenv/versions/3.8.3/lib/python3.8/site-packages/spacy_stanza/init.py", line 2, in
from .language import StanzaLanguage
File "/Users/lex/.pyenv/versions/3.8.3/lib/python3.8/site-packages/spacy_stanza/language.py", line 2, in
from spacy.symbols import POS, TAG, DEP, LEMMA, HEAD
File "/Users/lex/Desktop/spacy.py", line 2, in
from spacy_stanza import StanzaLanguage
ImportError: cannot import name 'StanzaLanguage' from partially initialized module 'spacy_stanza' (most likely due to a circular import) (/Users/lex/.pyenv/versions/3.8.3/lib/python3.8/site-packages/spacy_stanza/init.py)

Tokenization fails in german model for sentences containing contractions

Spacy version: 2.1.4
spacy-stanfordnlp version: 0.1.1

I downloaded and loaded the german model as described in the docs.

import stanfordnlp
from spacy_stanfordnlp import StanfordNLPLanguage

stanfordnlp.download('de')
snlp = stanfordnlp.Pipeline(lang="de")
stanford_nlp = StanfordNLPLanguage(snlp)

When parsing german contractions such as "im", "am", "zum" etc., I noticed a weird behavior. The first time around everything is fine, but on consecutive parses, some tokens are being duplicated and some omited. It's best to look at an example.

stanford_nlp('Das zum Beispiel funktioniert nicht.')
# Das zu dem Beispiel funktioniert nicht .

stanford_nlp('Das zum Beispiel funktioniert nicht.')
# Das zu dem zu dem Beispiel funktioniert nicht 

stanford_nlp('Das zum Beispiel funktioniert nicht.')
# Das zu dem zu dem zu dem Beispiel funktioniert 

stanford_nlp('Das zum Beispiel funktioniert nicht.')
# Das zu dem zu dem zu dem zu dem Beispiel funktioniert 

stanford_nlp('Das zum Beispiel funktioniert nicht.')
# Das zu dem zu dem zu dem zu dem zu dem Beispiel 

stanford_nlp('Das zum Beispiel funktioniert nicht.')
# Das zu dem zu dem zu dem zu dem zu dem zu dem Beispiel

stanford_nlp('Das zum Beispiel funktioniert nicht.')
# Das zu dem zu dem zu dem zu dem zu dem zu dem zu 

stanford_nlp('Das zum Beispiel funktioniert nicht.')
# Das zu dem zu dem zu dem zu dem zu dem zu dem zu

Parsing other contractions afterwards also doesn't work.

stanford_nlp('Am Anfang wäre das ok.')
# zu dem zu dem zu dem zu dem

But parsing other sentences with no contractions is alright.

stanford_nlp('Dabei hat man keine Schwierigkeiten.')
# Dabei hat man keine Schwierigkeiten.

Here is a list of german contractions. However, it doesn't break for all of them, as some are not split into separate tokens.

P.S.: It shouldn't really matter, but I'm running this in a jupyter notebook.

Token's idx in a lot of spaces text

I'm parsing texts with "en_ewt" model (default stanfordnlp English model). My program like this:

import stanfordnlp
from spacy_stanfordnlp import StanfordNLPLanguage
pipe = stanfordnlp.Pipeline(lang='en')
sp = StanfordNLPLanguage(pipe)

text = 'In a hole in the ground there lived a hobbit.'
doc = sp(text)
for token in doc:
    print(token.text, token.idx)

So the text is "In a hole in the ground there lived a hobbit.", I want to get token's text and idx and the results are:
In 0
a 3
hole 5
in 10
etc.

That's OK. Now, I put several spaces in the text, e.g. "In(5 spaces)a(5 spaces)hole in the ground there lived a hobbit." (5 spaces between "In", "a" & "hole"). I expect that token's idx should be 0, 7, 13, 18, etc. but they are still 0, 3, 5, 10.

So as I understand, the model counts several spaces as one space no matter how there really are. Is there any way to tell Stanford models to count spaces as is?

spaCy native model (like en_core_web_sm) works well in this case, but I'd like to work with Stanford models.

Thank you.

Stuck for Greek and Hebrew examples

I am using following code to get results on Greek and Hebrew language. It is giving no results, look like that it is stuck on these inputs.

Greek

snlp = stanfordnlp.Pipeline(lang="el")
nlp = StanfordNLPLanguage(snlp)
doc = nlp("συνεπεια στο ραντεβου")

Hebrew

snlp = stanfordnlp.Pipeline(lang="he")
nlp = StanfordNLPLanguage(snlp)
doc = nlp("השירות")

Can you tell why it is happening? How can I avoid from this?

Add pretrained word vectors

I think all StanfordNLP models come with pretrained word vectors, and (if I interpret their code correctly), they're available via either the pos model as:

unit_id = snlp.processors['pos'].pretrain.vocab._unit2id['spacy'] 
unit_vec = snlp.processors['pos'].pretrain.emb[unit_id]

unit_vec = snlp.processors['depparse'].pretrain.emb[unit_id]

Would it be possible to add those vectors as token attributes?

If you'd like I could try to implement it in a PR...

How to get gender?

found it =)

SPACE is not UPOS

Hey,
First of all thanks for the great job!
I am currently using stanza via spaCy for an small annotation projection project. However while integrating I realized that spacy-stanza uses an custom Universal POS tag. I guess its a bit against the idea of Universal POS tags and it makes my life harder since I need another run to filter those tags out.
My questions are now: Is there any reason why this wrapper does not filter them out? Is there any possible solution/workaround/filter possible to overcome this?
Thanks for your time!

Invalid parse tree state

Hi, it seems that in some cases of using StanfordNLP models the result is an invalid parse tree state. When trying to merge certain spans, I get RuntimeError: [E039] Array bounds exceeded while searching for root word. This likely means the parse tree is in an invalid state.

Here is a reproducible example (at least for my installation), failing when trying to merge an emoji:

import stanfordnlp
from spacy_stanfordnlp import StanfordNLPLanguage

# stanfordnlp.download('ca')
snlp = stanfordnlp.Pipeline(lang='ca')
ca = StanfordNLPLanguage(snlp)

txt = "🙅🚫 Els comentaris i els gestos ofensius o els tocaments indesitjats són violència masclista.  💬 Si vius alguna d’aquestes situacions, denuncia-ho a @mossos i informa’ns a través de l’app #BCNantimasclista per prevenir-les. 💪 #JuntesSomMésFortes!  ℹ️ https://t.co/gOnPU9vgdt https://t.co/qtntxX97Ih"

doc = ca(txt)
print(list(doc))
doc[16:17].merge()

This doesn't seem to happen with a regular Spacy language (the tokenization is slightly different, but merging spans including the same emoji works here):

import spacy
en = spacy.load('en')
doc = en(txt)
print(list(doc))
print(list(doc[16:18]))
doc[17:18].merge()
doc[16:18].merge()
print(list(doc))

Can't use the tokenizer only

Hi, thank you for your great work, it's very helpful.

I encountered a problem. When using spaCy, if I just need to tokenize a sentence without other lexical features, I can use nlp.tokenizer to reduce time for other pipes.

When using spacy-stanza, I tried to do in this way and it seems that the entire pipeline is still working.

Furthermore, I tried to print nlp.pipeline and it is an empty list, so I can't remove pipes.

This problem is quite confusing to me, I hope to solve it and look forward to your reply.

Build Vocab

Great work. So what about building the vocabulary at least word vectors for these languages ?
Thanks

Update stanza in requirements to fix this urgent bug

Stanza released new version to fix this urgent bug : stanfordnlp/stanza#417

please update requirements.txt in spacy-stanza to get the fix

ImportError: cannot import name 'hash_unicode' from 'murmurhash'

import spacy_stanza gives me this error:


Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "E:\WPy-3710\python-3.7.1.amd64\lib\site-packages\spacy_stanza\__init__.py", line 2, in <module>
    from spacy import blank, Language
  File "E:\WPy-3710\python-3.7.1.amd64\lib\site-packages\spacy\__init__.py", line 10, in <module>
    from thinc.api import prefer_gpu, require_gpu, require_cpu  # noqa: F401
  File "E:\WPy-3710\python-3.7.1.amd64\lib\site-packages\thinc\api.py", line 22, in <module>
    from .layers import Dropout, Embed, expand_window, HashEmbed, LayerNorm, Linear
  File "E:\WPy-3710\python-3.7.1.amd64\lib\site-packages\thinc\layers\__init__.py", line 53, in <module>
    from .strings2arrays import strings2arrays
  File "E:\WPy-3710\python-3.7.1.amd64\lib\site-packages\thinc\layers\strings2arrays.py", line 2, in <module>
    from murmurhash import hash_unicode
ImportError: cannot import name 'hash_unicode' from 'murmurhash' (E:\WPy-3710\python-3.7.1.amd64\lib\site-packages\murmurhash\__init__.py)

I am using Python 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 14:57:15) [MSC v.1915 64 bit (AMD64)] on win32, which is WinPython on Windows 10 64 bit.

Support for Spacy 3

The current latest release works with Spacy 2.3. Is there a release planned soon that supports Spacy 3?

Error when Tokenizer used

I am seeing following exception when I use Tokenizer as shown below.

If I close matcher part, no exception but no pos tagging this time.

I have tried to find info about exception but only thing I could found this one and it is unrelated: explosion/spaCy#4100

Is there any other pipeline configuration I have to use related with Tokenizer?
Could not see any in documentation?

By the way, I wanted to try add_special_case but wrapper does not support it I guess: "AttributeError: 'Tokenizer' object has no attribute 'add_special_case'"

Traceback (most recent call last):
  File "c:/x/_dev/_temp/pro/playground/spacydemo/a.py", line 41, in <module>
    matches = matcher(doc)
  File "matcher.pyx", line 224, in spacy.matcher.matcher.Matcher.__call__
ValueError: [E155] The pipeline needs to include a tagger in order to use Matcher or PhraseMatcher with the attributes POS, TAG, or LEMMA. Try using nlp() instead of nlp.make_doc() or list(nlp.pipe()) instead of list(nlp.tokenizer.pipe()).

import logging
import re
import stanfordnlp
from spacy.matcher import Matcher
from spacy_stanfordnlp import StanfordNLPLanguage
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex

processors = 'tokenize,pos,lemma'
config = {
    #'tokenize_pretokenized': True, #!!!  the text will be interpreted as already tokenized on white space and sentence split by newlines.
    'processors': processors,  # mwt, depparse
    'lang': 'en',  # Language code for the language to build the Pipeline in
}
snlp = stanfordnlp.Pipeline(**config)
nlp = StanfordNLPLanguage(snlp)


def custom_tokenizer(nlp):
    infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''')
    prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
    suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)

    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=None)

nlp.tokenizer = custom_tokenizer(nlp)

text = "Note: Since the fourteenth century the practice of “medicine” has become a profession; and more importantly, it\'s a male-dominated profession.'"

matcher = Matcher(nlp.vocab)
matcher.add("COUNTRY", None, *[
    [{'LEMMA': 'practice'}],
])

doc = nlp(text)

matches = matcher(doc)
for (match_id, start, end) in matches:
	label = doc.vocab.strings[match_id]
	print(label, start, end, doc[start:end])

print(doc)
for token in doc:
	#print("\t\t", token.text, "\t\t", token.lemma_, "\t\t", token.tag_, "\t\t", token.pos_)
	print(f"\t {token.text:{20}} - {token.lemma_:{15}} - {token.tag_:{5}} - {token.pos_:{5}}")

Speed

Spacy Stanza is much slower than merely Stanza

Use sentencizer with stanfordnlp

Right now spacy-stanfordnlp is taking care of the tokenization too. Would it be possible to use spacy' sentencizer and keeping stanfordnlp just for tagging and parsing?

I can only think about running two pipelines, the first one that only uses sentencizerand the second one that uses stanfordnlp.Pipeline. I will have a double tokenization, and probably a performance penalty

I'm getting through the doc and looking at the source code but can't find any proper way to do it

Including stanza support?

Stanfordnlp did a version bump of their repository and has renamed it to stanza. Are there any plans to make a spacy-stanza, or update this repository to include stanza support? I don't think their API has changed too much.

cannot import name 'StanzaLanguage' from 'spacy_stanza'

I see my problem

ValueError: [E167] Unknown morphological feature: 'Person' for Polish

I've successfully run spacy-stanza example for english. However I can't get it working with Polish

import stanza
from spacy_stanza import StanzaLanguage
stanza.download('pl')
snlp = stanza.Pipeline(lang='pl') 
nlp = StanzaLanguage(snlp) 
doc = nlp('Proste zdanie') # "Simple sentence"

Above works, however many other fails:

doc = nlp('To jest błąd') # "This is an error"

Traceback (most recent call last):
..s/spacy_stanza/language.py", line 205, in __call__
    doc = Doc(self.vocab, words=words, spaces=spaces).from_array(attrs, array)
  File "doc.pyx", line 830, in spacy.tokens.doc.Doc.from_array
  File "morphology.pyx", line 286, in spacy.morphology.Morphology.assign_tag
  File "morphology.pyx", line 315, in spacy.morphology.Morphology.assign_tag_id
  File "morphology.pyx", line 203, in spacy.morphology.Morphology.add
ValueError: [E167] Unknown morphological feature: 'Person' (2313063860588076218). This can happen if the tagger was trained with a different set of morphological features. If you're using a pretrained model, make sure that your models are up to date:
python -m spacy validate

Is this because there is no "NER" processor for Polish in Stanza? Is there any easy fix to make it working?

Takes too long to parse doc results

Hello,
It takes too long to parse the doc object, i.e to iterate over sentence and tokens in them. Is that expected ?

snlp = stanfordnlp.Pipeline(processors='tokenize,pos', models_dir=model_dir)
nlp = StanfordNLPLanguage(snlp)

for line in lines:
    doc = nlp.pipe([line])

The above code takes few milliseconds (apart from initialisation) to run over 500 sentences,

snlp = stanfordnlp.Pipeline(processors='tokenize,pos', models_dir=model_dir)
nlp = StanfordNLPLanguage(snlp)

for line in lines:
    doc = nlp.pipe([line])
    token_details = []

    for sents in doc:
        for tok in sents:
            token_details.append([tok.text, tok.lemma_, tok.pos_])

while this takes almost a minute(apart from initialisation) to run over 500 sentences

P.S : Have put nlp.pipe() inside a for loop intentionally to get all tokens for one sentence even though it gets segmented.

token.idx gives incorrect index depends on some other tokens in doc

Hi, found a bug: when the token "2 999" or "4 500" contain whitespace, the token.idx gives incorrect position:

>>> import stanza
>>> from spacy_stanza import StanzaLanguage
>>> stanza.download('ru')
>>> snlp = stanza.Pipeline(lang="ru")
>>> nlp = StanzaLanguage(snlp)
>>> text = text = "ведь первый месяц это минимум 2 999 рублей за устройство, плюс 4 500 на фурнитуру."
>>> doc = nlp(text)
>>> for token in doc: 
...     print(f"{token.text + ' '*(10-len(token.text))}  token.idx: {token.idx}  text.index(token.text): {text.index(token.text)}")
... 
ведь        token.idx: 0  text.index(token.text): 0
первый      token.idx: 5  text.index(token.text): 5
месяц       token.idx: 12  text.index(token.text): 12
это         token.idx: 18  text.index(token.text): 18
минимум     token.idx: 22  text.index(token.text): 22
2 999       token.idx: 30  text.index(token.text): 30
рублей      token.idx: 36  text.index(token.text): 36
за          token.idx: 43  text.index(token.text): 43
устройство  token.idx: 46  text.index(token.text): 46
,           token.idx: 57  text.index(token.text): 56     <---------------- there
плюс        token.idx: 59  text.index(token.text): 58
4 500       token.idx: 64  text.index(token.text): 63
на          token.idx: 70  text.index(token.text): 69
фурнитуру   token.idx: 73  text.index(token.text): 72
.           token.idx: 83  text.index(token.text): 81

In case with no spaces:

>>> text = "ведь первый месяц это минимум 2999 рублей за устройство, плюс 4500 на фурнитуру."
>>> doc = nlp(text)
>>> for token in doc: 
...     print(f"{token.text + ' '*(10-len(token.text))}  token.idx: {token.idx}  text.index(token.text): {text.index(token.text)}")
... 
ведь        token.idx: 0  text.index(token.text): 0
первый      token.idx: 5  text.index(token.text): 5
месяц       token.idx: 12  text.index(token.text): 12
это         token.idx: 18  text.index(token.text): 18
минимум     token.idx: 22  text.index(token.text): 22
2999        token.idx: 30  text.index(token.text): 30
рублей      token.idx: 35  text.index(token.text): 35
за          token.idx: 42  text.index(token.text): 42
устройство  token.idx: 45  text.index(token.text): 45
,           token.idx: 55  text.index(token.text): 55
плюс        token.idx: 57  text.index(token.text): 57
4500        token.idx: 62  text.index(token.text): 62
на          token.idx: 67  text.index(token.text): 67
фурнитуру   token.idx: 70  text.index(token.text): 70
.           token.idx: 79  text.index(token.text): 79

Offset misalignment in NER StanzaLanguage Tokenizer

text = """ Tobacco/Smoke Exposure Family members smoke indoors, Daily. Caffeine use Coffee,"""
doc = snlp(text)
print([(e.text, e.label_, text[e.start_char:e.end_char]) for e in doc.ents])

Gives the output:

UserWarning: Can't set named entities because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Ents: [('Caffeine use', 'disease', 61, 73)]
doc = snlp(text)
[]

On printing the two texts i.e. snlp_doc.text, doc.text
Getting following texts:

snlp_doc.text = " Tobacco/Smoke Exposure Family members smoke indoors, Daily. Caffeine use Coffee,"
doc.text =   "  Tobacco / Smoke Exposure Family members smoke indoors , Daily . Caffeine use Coffee ,"

Because of which above error is coming and we are losing the identified entities
Even with basic configs mentioned in readme:

Access to named entities index at token level

I was wondering whether is is possible to access named entities index at token level, for example:
"Barack Obama was born in Hawaii."
NE = Barack Obama
NE_start : 0
NE_end : 2
I'm working on a project and need the start and end index of each named entity of a given sentence ; Spacy does provide entity index at token level (but does not provide named entity recognition at sentence level) while Stanza does provide named entity recognition at sentence level (but does not provide entity index at token level) so I'm not happy with either of them.
I was able to somehow work my way through with the id attributes of token objects on Stanza but I'm stuck if named entities are made up of more than one token.
Thank you in advance.

Stanza's sentencizer only works when `processors = 'tokenize,pos,lemma,depparse'`

Hi all,

I started an NLP project where I needed high accuracy sentence segmentation, and therefore decided to use stanza.

I was thrilled to find this library, since Spacy is quite intuitive. However, I found that the sentence segmentation only gets carried into spacy under certain conditions.

Baseline:

The baseline text is to use the Stanza model alone to see if the sentence segmentation works.

This is the simplest model that I could use, I simply turned on the tokenize processor.

Test with Spacy-Stanza:

I then tried the same thing, but this time added the spacy-stanza wrapper.

As shown above, the sentences were not actually tokenized.

Test with spacy-stanza with more processors on Stanza:

It seems that the depparse processor is necessary, but this is rather confusing since the vanilla stanza model does not require it to tokenize.

Any help would be appreciated :)

Morphological features are lost in russian model

spaCy version: 2.1.9
spaCy-stanza version: 0.2.1

import stanza
from spacy_stanza import StanzaLanguage

stanza.download('ru')
snlp = stanza.Pipeline(lang="ru")
nlp = StanzaLanguage(snlp)
text = "Мама мыла раму"

Using stanza, i get this:

for sentence in snlp(text).senteces:
	for word in sentence.words:
		print(word.feats)

# Animacy=Anim|Case=Nom|Gender=Fem|Number=Sing
# Animacy=Inan|Case=Gen|Gender=Neut|Number=Sing
# Animacy=Inan|Case=Acc|Gender=Fem|Number=Sing

Using spac-stanza, i get this:

for token in nlp(text):
	print(token.tag_)

# 
# 
#

But other annotations (such as lemma, pos, dep etc.) are available.

Is Spacy given priority over Stanford for same language model?

Using stanfordnlp, Lemma results on an input are:

snlp_en = stanfordnlp.Pipeline(lang="en")
doc = snlp_en("He was a better batsman")
for sentence in doc.sentences:
    for token in sentence.tokens:
        for words in token.words:
            print(words.text, "\t\t", words.lemma)

He 		 he
was 		 be
a 		 a
better 	better
batsman 	batsman

Now using latest wrapper given by spacy spacy-stanfordnlp and getting following results.

snlp_en = stanfordnlp.Pipeline(lang="en")
nlp_en = StanfordNLPLanguage(snlp_en)
doc = nlp_en("He was a better batsman")
for token in doc:
    print(token.text, "\t\t", token.lemma_)

He 		 -PRON-
was 		 be
a 		 a
better 	well
batsman 	batsman

So, it look like that Spacy is given priority(you can see for word "he" and "better"). So, when a language has models in both Spacy and Stanford, then how results will be coming? Can you provide a full details how linguistic features will be affected in this case?

Entry points for additional languages are required

So far the only language specified in the entry points is English, if another language is required the entry points have to be manually modified. This is not nice. I suggest that the entry points, should have the languages available for stanza.

Relevant part of the code of setup.py, is line 37:

It states:

entry_points={"spacy_languages": ["stanza_en = spacy_stanza:StanzaLanguage"], the list should be like:

entry_points={"spacy_languages": ["stanza_en = spacy_stanza:StanzaLanguage", "stanza_es = spacy_stanza:StanzaLanguage", "stanza_pt"= spacy_stanza:StanzaLanguage",...]

Access CoreNLP parsing tree

Hi,
Is it possible to access CoreNLP parsing tree from the doc object? Not only dependencies.
Thank you,
Lucas Willems

How do use the NER stanfordnlp annotator?

Using this wrapper, you'll be able to use the following annotations, computed by your pretrained stanfordnlp model:

Statistical tokenization (reflected in the Doc and its tokens)
Lemmatization (token.lemma and token.lemma_)
Part-of-speech tagging (token.tag, token.tag_, token.pos, token.pos_)
Dependency parsing (token.dep, token.dep_, token.head)
Sentence segmentation (doc.sents)

Where is Named Entity Recognition? https://stanfordnlp.github.io/CoreNLP/ner.html

Also SpaCy's own website says specifically state of the art comes with CORENLP not SpaCy

https://cl.ly/285a4edaf7a5/Image%202019-06-06%20at%207.32.06%20PM.png

Specify GPU to use

I have a system with two GPUs and create pipes with this code

pipe = stanza.Pipeline(lang=lang, use_gpu=True)

Is there any way to specify what GPU should pipes use? Maybe option like device="cuda:0"?

Port trailing whitespace fix to master

As an internal reminder: port #60 to master after #58 is merged.

Mixing stanfordNLP model with spacy NER not working

Hi, I'm using spacy for extracting entities from documents.
The NER component is very good, however, the sentence splitting on my documents (legal type documents with long sentences) is quite horrible, while stanfordNLP splits the sentences quite well. I wanted to use the StanfordNLP model along with the NER pipe from spacy to have the best of both worlds.
However, when I run almost the exact code shown in the example of how to do it (except the model is the large model and the text is the text of my document)

snlp = stanfordnlp.Pipeline(lang="en")
nlp = StanfordNLPLanguage(snlp)
spacy_model = en_core_web_lg.load()
ner = spacy_model.get_pipe("ner")
nlp.add_pipe(ner)
doc = nlp(text)

and try to loop over the entities, I'm getting a

../aten/src/ATen/native/LegacyDefinitions.cpp:14: UserWarning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead.
../aten/src/ATen/native/LegacyDefinitions.cpp:14: UserWarning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead.
../aten/src/ATen/native/LegacyDefinitions.cpp:14: UserWarning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead.
Traceback (most recent call last):
  File "/Users/omri/PycharmProjects/NER/main.py", line 58, in <module>
    find_entities(text)
  File "/Users/omri/PycharmProjects/NER/main.py", line 19, in find_entities
    for ent in doc.ents:
  File "doc.pyx", line 512, in spacy.tokens.doc.Doc.ents.__get__
  File "span.pyx", line 118, in spacy.tokens.span.Span.__cinit__
ValueError: [E084] Error assigning label ID 9191306739292312949 to span: not in StringStore.

I think this points to differences in the vocab of the spacy model and the StanfordNLP model.
I'm wondering, how can it be fixed?

Thanks!

Offset misalignment in NER using the Stanza tokenizer for French

Hi everyone,

I just found a problem when trying to analyze a French sentence. When I run the following code:

snlp = stanza.Pipeline(lang="fr", verbose=False)
stanzanlp = StanzaLanguage(snlp)

text = "C'est l'un des grands messages passés par Bruno Le Maire, ce matin sur RTL."
doc = stanzanlp(text)

I get this error:

/home/victor/miniconda3/envs/nlp/lib/python3.7/site-packages/ipykernel_launcher.py:4: UserWarning: Can't set named entities because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ["C'", 'est', "l'", 'un', 'de', 'les', 'grands', 'messages', 'passés', 'par', 'Bruno', 'Le', 'Maire', ',', 'ce', 'matin', 'sur', 'RTL.']
Entities: [('Bruno Le Maire', 'PER', 42, 56), ('RTL.', 'ORG', 71, 75)]
  after removing the cwd from sys.path.

Analyzing the same text with the default French model in spaCy, I get almost the same tokens: take a look at the final stop.

doc = spacynlp(text)

for token in doc:
    print(token.text, token.idx)
    
for ent in doc.ents:
    print(ent.text, ent.label_)

C' 0
est 2
l' 6
un 8
des 11
grands 15
messages 22
passés 31
par 38
Bruno 42
Le 48
Maire 51
, 56
ce 58
matin 61
sur 67
RTL 71
. 74
Bruno Le Maire PER
RTL ORG

Is anyone having the same issues?

Spacy-stanza and Spacy conflict when calling pipelines on the GPU

If either spacy.prefer_gpu() or .require_gpu() are called anytime a Stanza pipeline is/will be loaded on the GPU, the successive pipeline runs will fail.

Is there any way to circumvent this or should one of the pipelines be on the CPU if the two need to be loaded at the same time?

How to reproduce the behaviour

import spacy_stanza
import spacy

snlp = spacy_stanza.load_pipeline('fi', processors='tokenize, mwt, lemma, pos, depparse')
spacy.prefer_gpu()
doc = snlp("Tämä on esimerkkilause. Toinen.")

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\x\anaconda3\envs\lang_ai\lib\site-packages\spacy\language.py", line 977, in __call__
    doc = self.make_doc(text)
  File "C:\Users\x\anaconda3\envs\lang_ai\lib\site-packages\spacy\language.py", line 1059, in make_doc
    return self.tokenizer(text)
  File "C:\Users\x\anaconda3\envs\lang_ai\lib\site-packages\spacy_stanza\tokenizer.py", line 83, in __call__
    snlp_doc = self.snlp(text)
  File "C:\Users\x\anaconda3\envs\lang_ai\lib\site-packages\stanza\pipeline\core.py", line 210, in __call__
    doc = self.process(doc)
  File "C:\Users\x\anaconda3\envs\lang_ai\lib\site-packages\stanza\pipeline\core.py", line 204, in process
    doc = process(doc)
  File "C:\Users\x\anaconda3\envs\lang_ai\lib\site-packages\stanza\pipeline\pos_processor.py", line 33, in process
    preds += self.trainer.predict(b)
  File "C:\Users\x\anaconda3\envs\lang_ai\lib\site-packages\stanza\models\pos\trainer.py", line 73, in predict
    _, preds = self.model(word, word_mask, wordchars, wordchars_mask, upos, xpos, ufeats, pretrained, word_orig_idx, sentlens, wordlens)
  File "C:\Users\x\anaconda3\envs\lang_ai\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\x\anaconda3\envs\lang_ai\lib\site-packages\stanza\models\pos\model.py", line 100, in forward
    word_emb = pack(word_emb)
  File "C:\Users\x\anaconda3\envs\lang_ai\lib\site-packages\stanza\models\pos\model.py", line 95, in pack
    return pack_padded_sequence(x, sentlens, batch_first=True)
  File "C:\Users\x\anaconda3\envs\lang_ai\lib\site-packages\torch\nn\utils\rnn.py", line 244, in pack_padded_sequence
    _VF._pack_padded_sequence(input, lengths, batch_first)
RuntimeError: 'lengths' argument should be a 1D CPU int64 tensor, but got 1D cuda:0 Long tensor

Only Stanza pipeline seems to be affected, since loading in and running Spacy pipelines appears to work normally. E.g.

...
nlp = spacy.load("en_core_web_sm")
doc = nlp("This is an example sentence. Another one.")
# no error

I used the Finnish Stanza pipeline since I have that downloaded, but the same issue has been reported previously for other languages as well in Stanza's issues.

Info about spaCy

spaCy version: 3.0.3
spaCy-stanza version 1.0.0
spaCy-transformers version 1.0.1
Platform: Windows-10-10.0.19041-SP0
Python version: 3.8.5
Pipelines: en_core_web_sm (3.0.0)

install requirements

I see there "stanfordnlp>=0.1.0,<0.2.1" line in requirement.txt but still getting following error.

ERROR: spacy-stanfordnlp 0.1.2 has requirement stanfordnlp<0.2.0,>=0.1.0, but you'll have stanfordnlp 0.2.0 which is incompatible.

I guess there is a problem in last release.

Mutli process doesn't work

spaCy version: 2.2.4
spacy-stanza version: 0.2.1
stanza version: 1.0.1

It is not possible to use multiple processes in the pipeline while using the Russian model.

import spacy
from spacy_stanza import StanzaLanguage

stanza.download("ru")

snlp = stanza.Pipeline(lang="ru")
ru_nlp = StanzaLanguage(snlp)

text = ["это какой-то русский текст"] * 100

for doc in ru_nlp.pipe(text, batch_size=50, n_process=2):
    print(doc.is_parsed)

While running the example with n_process=1 it works, however with n_process greater than 1 nothing gets printed, no errors and script doesn't terminate.

Matcher result problem

I have successfully run parser and see pos/dependency result same as with stanfordnlp.
But when I run Matcher over tokens I see unexpected matchs.
For example I should get only PROPN+ tokens but I see verb matches as well.
And empty match...

I have gone through following code here but I could not see something related with Matcher.
https://github.com/explosion/spacy-stanfordnlp/blob/master/spacy_stanfordnlp/language.py

BTW, default installation comes with spacy-nightly 6a and I have tried with 9a as well.

import stanfordnlp
from spacy_stanfordnlp import StanfordNLPLanguage
from spacy.matcher import Matcher
import csv

config = {
    'processors': 'tokenize,pos,lemma', #mwt, depparse
	'lang': 'en', # Language code for the language to build the Pipeline in
}

snlp = stanfordnlp.Pipeline(**config)
nlp = StanfordNLPLanguage(snlp)

matcher = Matcher(nlp.vocab)
matcher.add("ProperNounRule", None, *[
	# [{'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'PROPN'}],
	# [{'POS': 'PROPN'}, {'POS': 'PROPN'}]
	[{'POS': 'PROPN', 'OP': '+'}]
	# {'POS': 'NOUN', 'OP': '+'}
])

text = "US was among the first countries to recognise opposition leader Juan Guaido as legitimate leader, arguing President Nicholas Maduro's May 2018 re-election was a sham. Maduro accuses Guaido of being a coup-mongering puppet for US President Trump."

text = text.replace("“", "\"").replace("”", "\"").replace("’", "'")
doc = nlp(text)

matches = matcher(doc)
print('\n>>> Match result')
for (match_id, start, end) in matches:
	label = doc.vocab.strings[match_id]	
	span = doc[start:end]	
	print(label, ":", str(span), ">", start, ":", end)

Bolds are expected result.

**ProperNounRule : US > 0 : 1**
**ProperNounRule : Juan > 10 : 11**
**ProperNounRule : Juan Guaido > 10 : 12**
ProperNounRule : as > 11 : 12
**ProperNounRule : Nicholas > 17 : 18**
**ProperNounRule : Nicholas Maduro > 17 : 19**
ProperNounRule : 's > 18 : 19
**ProperNounRule : Nicholas Maduro's May > 17 : 20**
ProperNounRule : 2018 re-election > 18 : 20
ProperNounRule : was > 19 : 20
ProperNounRule : sham > 21 : 22
ProperNounRule : a > 28 : 29
ProperNounRule : - > 30 : 31
ProperNounRule :  > 39 : 40
ProperNounRule :  > 39 : 41

Assertion error in makedoc if spacy tokenizer is used in stanza and text contains newline

Hi Ines, sorry that I ran into a small bug. I could at least track down the symptoms.
The problem occurs if I use stanza with the spacy tokenizer and my text contains a newline.
The obvious workaround is this one: text = re.sub(r'\s+', ' ', text)

# spacy.__version__ # 2.3.0
# stanza.__version__ # 1.0.1
# spacy_stanza.__version__ # 0.2.3

text = "The FHLBB was insolvent and its\nassets were transferred. "
# works if \n in text is replaced by a space

import spacy, stanza, spacy_stanza
from spacy_stanza import StanzaLanguage

# stanza nlp works fine
stanza_nlp = stanza.Pipeline('en'), processors={'tokenize': 'spacy'})
doc = stanza_nlp(text)

# spacy stanza throws assertion 
spacy_stanza_nlp = StanzaLanguage(stanza_nlp)
doc = spacy_stanza_nlp.make_doc(text)

Here the trace:

---------------------------------------------------------------------------
AssertionError                  Traceback (most recent call last)
---> 22 doc = spacy_stanza_nlp.make_doc(text)

.../spacy-stanza/spacy_stanza/language.py in make_doc(self, text)
     65         these will be mapped to token vectors.
     66         """
---> 67         doc = self.tokenizer(text)
     68         if self.svecs is not None:
     69             doc.user_token_hooks["vector"] = self.token_vector

.../spacy-stanza/spacy_stanza/language.py in __call__(self, text)
    193             else:
    194                 token = snlp_tokens[i + offset]
--> 195                 assert word == token.text
    196 
    197                 pos.append(self.vocab.strings.add(token.upos or ""))

AssertionError:

Issue with Whitespaces for German

Hello,

it seems like there exists an issue with the trailing whitespaces of tokens in case of, e.g., German.

import sys
import traceback
import stanfordnlp
from spacy_stanfordnlp import StanfordNLPLanguage
    
def stanford_tokenizer(text, language):
    snlp = stanfordnlp.Pipeline(lang=language)
    nlp = StanfordNLPLanguage(snlp)
    try:
        doc = nlp(text)
        tokenized_doc = ("".join([token.text_with_ws for token in doc]))
    except:
        traceback.print_exc()
        sys.exit()
    return tokenized_doc

text = """ Juliana kommt aus Paris. Das ist die Hauptstadt von Frankreich. In diesem Sommer macht sie einen Sprachkurs in Freiburg. 
Das ist eine Universitätsstadt im Süden von Deutschland. Es gefällt ihr hier sehr gut. Morgens um neun beginnt der Unterricht, um vierzehn Uhr ist er zu Ende.
In ihrer Klasse sind außer Juliana noch 14 weitere Schüler, acht Mädchen und sechs Jungen. Sie kommen alle aus Frankreich, aber nicht aus Paris.
"""

tokenized_text = stanford_tokenizer(text, "de")
print(tokenized_text)

Output:
Juliana kommt aus Paris . Das ist die Hauptstadt von Frankreich . In diesem Sommer macht sie einen Sprachkurs in Freiburg . Das ist eine Universitätsstadt in dem Süden von Deutschland . Es gefällt ihr hier sehr gut . Morgens um neun beginnt der Unterricht , um vierzehn Uhr ist er zu Ende . In ihrer Klasse sind außer Juliana noch 14 weitere Schüler , acht Mädchen und sechs Jungen . Sie kommen alle aus Frankreich , aber nicht aus Paris .

As one can see, the periods at the end of the sentences are put with one additional whitespace to the last token of a sentence. The same holds for other punctuation symbols while Spacy would detect whether there actually exists a trailing whitespace.

Source of sample text: https://lingua.com/german/reading/

Sentence splitting is not working with multiple spaces after punctuation

I am trying to split sentences into segments based on obvious punctuation marks like '.', '?', '!' and have been able to do so easily using Spacy Sentencizer in the pipeline. Now when I try to use Spacy-Stanza to split it, it works fine until there are multiple spaces after the punctuation mark.

snlp = stanza.Pipeline(lang='en') nlp = StanzaLanguage(snlp) doc = nlp('This is a test message. Second. Third? Fourth! Fifth')

I am getting this warning:

UserWarning: Can't set named entities because the character offsets don't map to valid tokens produced by the Stanza tokenizer:

And this is the output:

['This is a test message.', 'Second. Third?', 'Fourth! Fifth']

How can I get the desired output? On adding Sentencizer in nlp pipeline, it gives error probably because the input it receives after processing of snlp is not in desired format (parsed). And when I add it to the processors of snlp, it makes no difference.

Extra spaces causes token mis-alignment

If there are multiple white space characters between tokens, Tokenizer will raise a warning and the entity will not be extracted. It looks like stanza does not treat the extra white space as a token, while spaCy would.

import stanza
from spacy_stanza import StanzaLanguage
snlp = stanza.Pipeline(lang='en')
nlp = StanzaLanguage(snlp)
text = "There  are  two  spaces  between  these  words"
doc = nlp(text)
>>> UserWarning: Can't set named entities because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ['There', 'are', 'two', 'spaces', 'between', 'these', 'words']
Entities: [('two', 'CARDINAL', 12, 15)]
print(len(doc.ents)) >>> 0

Infinite loop if token texts don't match input text

It works with other text but running this code will cause an infinite loop. It does not like "im Anhang".

snlp = stanfordnlp.Pipeline(lang="de")
nlp = StanfordNLPLanguage(snlp)

doc = nlp("im Anhang")
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.dep_)

[E048] Can't import language stanza_et from spacy.lang: No module named 'spacy.lang.stanza_et'

Hi, I have been trying to use your module for Estonian stanza (stanza.download("et")), but I can't succeed. When I follow the tutorial in README.md I end up with the error in the headline.

The stanza-spacy-model created by nlp.to_disk("./stanza-spacy-model") creates a folder that has only a vocab/ folder and meta.json. Is this correct? Where can I specify that language "stanza_et" is actually okay?

Nice Idea but without NER not that useful

Spacy is great for visualizations and the work done with Prodigy. But its NER engine comes no where close to CoreNLP.