Comments (4)
We're adding them in, not filtering them out. stanza
itself doesn't return any annotation for whitespace, or if you feed it whitespace-only tokens I think you get nonsense back, like NOUN
because the models aren't trained to handle whitespace tokens.
The way a spaCy Doc
is stored underneath, the text is just the sum of the individual token texts, so you need some way to represent every bit of the original text including whitespace. Anything beyond a single trailing space is turned into a separate whitespace token. We could also potentially use X
, but spacy
has been using _SP
for token.tag
and SPACE
for token.pos
since the earliest versions of the library, so it makes sense for this wrapper to behave the same way. I agree that it goes against the idea of UPOS, but it looks it was added because there were cases where whitespace vs. not-whitespace was a useful distinction.
It should be easy to write a custom pipeline component to convert SPACE
to X
if you'd like, and once we've released a v3-compatible version (coming very soon!), there you can use the attribute_ruler
to do this.
from spacy-stanza.
Thanks for the quick answer and explanation! I already assumed its more depend on spaCy than on the mind model behind UPOS.
However converting them into X
taints the complete annotation which makes a annotation projection not really feasible, since not all tokenizer produce space tokens[1]. (That's the reason why stanza make crap out of blank lines). Also blanks/spaces are not really part of speech at all (they have no syntactic meaning).
Writing an additional component is actually not my problem (I already done it). Its more iterating once more over an corpus again, which slows down the whole annotation part.
At least it would be really nice, if you could make consumer of this package aware of this behavior. It might seems obvious to people using spaCy on more daily bases, but not for people like me, which are assuming it sticks to standards. Also it should be recognized for spaCy itself since the linkage for Token#pos_ is misleading and causing bugs like in my case.
[1]: Before you ask. Since I deal during a projection with at least 2 corpora and I cannot assume that during the projection all components using the exact same tokenizer, so I need to stick to standards like UPOS rather then concrete implementations like spaCy.
from spacy-stanza.
Yes, the spaCy docs could be improved here.
If you don't want to convert SPACE
to a valid UPOS tag, I'm not sure what kind of answer you're looking for? A spaCy Doc
is going to include these whitespace tokens if there's whitespace beyond a single trailing space in the input, so if you don't want any space tokens in a Doc
, the only option is to preprocess the texts to collapse contiguous whitespace to a single space.
stanza's Document
from a plain stanza pipeline might be more suitable for you than spaCy's Doc
?
from spacy-stanza.
Thanks again for the quick answer.
Plain stanza is currently not an option for this iteration, since this requires a lot of changes to my project, which I cannot effort to do at the moment due time pressure.
The answer I am seeking is more like - "hey we already know that and we are planning to go around this with ...". But it's also okay, if the answer is: "Oh, we make that by intend.", which it looks like to be.
However I would be very grateful, if this will addressed in the docs - it costs me several hours to figure this out and I will probably not the last person, which stumbles over that.
Anyways, thanks for the help.
from spacy-stanza.
Related Issues (20)
- Support for Spacy 3 HOT 6
- Port trailing whitespace fix to master
- ImportError: cannot import name 'hash_unicode' from 'murmurhash' HOT 5
- Spacy-stanza and Spacy conflict when calling pipelines on the GPU HOT 2
- Spacy Tokenization encoding problem HOT 6
- Spacy Tokenizer Boundary Issue. HOT 1
- Multi-word token expansion issue, misaligned tokens --> failed NER (German) HOT 4
- [W109] Unable to save user hooks while serializing the doc HOT 3
- Question: fine tuning stanza models from within Spacy HOT 1
- stanza.download('en') not working HOT 1
- Streamline behavior when xpos/tag is None HOT 2
- Add stanza constituency output HOT 2
- NER & Parsing not working for new language HOT 2
- AttributeError: module 'spacy_stanza' has no attribute 'load_pipeline' HOT 2
- Upgrade `stanza` version to 1.4.0 in the requirements.txt
- Can't use Spacy-Stanza in a databricks/spark UDF
- how to enable resource.json from local path when spacy_stanza.load_pipeline HOT 2
- Custom sentence segmentization HOT 2
- Building an NER pipeline for languages supported by stanza but not spacy HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spacy-stanza.