Bot Misparim is a script developed for Hebrew Wikipedia for catching common grammar mistakes related to numbers.
In Hebrew one very common mistake is using the wrong gender (שני ילדות; שתי ילדים).
It isn't easy task to parse a sentence and understand to what word the number in the sentence is related to - this bot does it based on simple regex heuristics* rules. The bot uses hspell project for analyzing words and classify their gender.
* You are welcome to challenge this approach with more advanced NLP/ML approaches (CFG/PCFGs, LSTM/RLU etc) for catching more grammar errors.
- pywikibot - python framework to access wikipedia
- HspellPy - a python wrapper for hspell
python misparim.py -xml:XML_DUMP
You can get dump from http://dumps.wikimedia.org/
Advances usage:
python misparim.py -xml:XML_DUMP -fix
Semi-automatic bot for fixing the suspected errors. (use with CAUTION)
The bot is heavily based on heuristics so don't always trust it!
Here are some possible mistakes:
- שבע may be
- זקן ושבע ימים
- שבע מרורים
- בני תשע + (males)