filyp / autocorrect Goto Github PK
View Code? Open in Web Editor NEWSpelling corrector in python
License: GNU Lesser General Public License v3.0
Spelling corrector in python
License: GNU Lesser General Public License v3.0
Hi!
I tried to follow the explanation on ading new languages.
count_words('hiwiki-latest-pages-articles.xml', 'hi')
does not work for me. It says
~\Miniconda3\lib\site-packages\autocorrect\word_count.py in count_words(src_filename, lang, encd, out_filename)
17 def count_words(src_filename, lang, encd=None, out_filename='word_count.json'):
18 words = get_words(src_filename, lang, encd)
---> 19 counts = Counter(words)
20 # make output file human readable
21 counts_list = list(counts.items())
~\Miniconda3\lib\collections\__init__.py in __init__(*args, **kwds)
566 raise TypeError('expected at most 1 arguments, got %d' % len(args))
567 super(Counter, self).__init__()
--> 568 self.update(*args, **kwds)
569
570 def __missing__(self, key):
~\Miniconda3\lib\collections\__init__.py in update(*args, **kwds)
653 super(Counter, self).update(iterable) # fast path when counter is empty
654 else:
--> 655 _count_elements(self, iterable)
656 if kwds:
657 self.update(kwds)
~\Miniconda3\lib\site-packages\autocorrect\word_count.py in get_words(filename, lang, encd)
7
8 def get_words(filename, lang, encd):
----> 9 word_regex = word_regexes[lang]
10 capitalized_regex = r'(\.|^|<|"|\'|\(|\[|\{)\s*' + word_regexes[lang]
11 with open(filename, encoding=encd) as file:
KeyError: 'hi'
Best regards
Robert
Thanks for this wonderful lib!
Can you add some functionality to detect accidentally merged words, for example, when a whitespace (separating words apart) was omitted?
from autocorrect import Speller
spellEn = Speller('en')
[spellEn.get_candidates(lemma) for lemma in ['test','project','testproject']]
>>>[[(495684, 'test')], [(1628175, 'project')], [(0, 'testproject')]]
It would be cool if 'testproject' could produce correct candidates: 'test' and 'project'
How hard is it to add such a feature?
Hi, I have some problems with download polish package. Looks like dropbox block the link.
couldn't download https://dl.dropboxusercontent.com/s/40orabi1l3dfqpp/pl.tar.gz?dl=0, trying next url... Traceback (most recent call last): File "C:\Users\lisek\Desktop\test.py", line 15, in <module> spell = Speller('pl') File "D:\Python\lib\site-packages\autocorrect\__init__.py", line 78, in __init__ self.nlp_data = load_from_tar(lang) if nlp_data is None else nlp_data File "D:\Python\lib\site-packages\autocorrect\__init__.py", line 61, in load_from_tar raise ConnectionError( ConnectionError: HTTP Error 429: Too Many Requests Fix your network connection, or manually download
Is there any chance to get language pack another way?
There is currently an issue which causes .exe files created using pyinstaller to crash.
If the source Python file used the autocorrect library, the program would crash if the user tried to launch it from the .exe file. "dictionary for this language not found, downloading..." will appear in the terminal, then "couldn't download https://drive.google.com/uc?export=download&id=19xqFyk9d8aFR7LR43oy6cExk8Pk9wVwV, trying next url...", followed by a ConnectionError: [Errno 2]. When I visit the link in my browser, it seems to work. This is for the English language.
Is it possible to fix this, or is the library not meant to be used from a stand-alone executable?
Let's add instructions to guide contributors.
Like specifying the tools and coding style to be used while making contributions.
Before everything, it is great tool. Thank you for your work. Altough Turkish-nlp is a difficult task, your model handels succesfully. However, there are some mistakes in some specific sentences which I tried for testing. It doesn't fix some specific words. At this point, I want to improve this model. What can I do for it ? If I can collect huge corpus for training from scratch, would that be useful ?
Hi, since the code or at least the algorithm is based on Peter Norvig's spelling corrector https://norvig.com/spell-correct.html you should mention it in the readme file. This way the advanced reader will quickly get an idea of the implementation.
Right now it's only mentioned here https://github.com/filyp/autocorrect/blob/0fa2a7cab20a44b9d8393246d9553629bb35e077/autocorrect/typos.py
This word "saree" exists but it gets autocorrect to "spree" how to avoid this one?
def double_typos(self):
"""letter combinations two typos away from word"""
return chain.from_iterable(
Word(e1, only_replacements=self.only_replacements).typos()
for e1 in self.typos()
)
when fixing the second typo, a Word object is created, but without specifying a language. thus, the second typo correction always uses the english alphabet. so, double_typoing a word like здрйхствуйте
returns corrections like здраmствуйте
instead of здравствуйте
, replacing one of the letters properly, and another with an english letter.
Hi.
From the commit messages it appears that the project changed its lincese twice in the past month (GPL and LGPL).
And the original project was MIT.
Can you tell us why the changes were made?
WHile following the tutorial for the hindi language (but using the italian wiki page instead), this error apperead:
'charmap' codec can't decode byte 0x9d in position 5050: character maps to <undefined>
This is the code used:
`
from autocorrect.word_count import count_words
count_words('itwiki-latest-pages-articles.xml', 'it')
`
When trying to download polish language pack spell = Speller(lang='pl')
I get couldn't download https://siasky.net/AAC52kUAAmfZF_DUZVMQWWO-tJ3g3_3FDEhR2BUXr4oq8Q, trying next url...
.
Hi,
First of all - this looks great. Thanks a lot.
I compared 3 different packages (yours, pyspellchecker, textblob) and yours does the best.
How can I improve performance? Is there a way to finetune this to a specific data set?
i did everything by the instruction in the description but i could not find a xml file for my language. Please, anybody tell me how to find and add a new language.
Thanks a lot in advance!
Hi @filyp
Thanks for giving the provision to change the dict to own text file. And yours is the only package I found which replaces the word in sentences, else every other is working on 1 word at a time.
However, I am facing an error while doing tar to the output. Attaching screenshot.
Can you please help?
Also, I see that it works fine till changes = 2, how to increase this?
Eg: spell('NissSSan') returns 'Nissan' and
spell('NissSSSan') returns 'NissSSSan'
Help will be really appreciated.
from autocorrect import Speller
This line is throwing the error as mentioned in the title. Has it been deprecated or is there any workaround?
Hello.
I can't find a function to ignore a word.
In some cases we need it.
An example in the English dictionary:
srai -> sry
but I have to ignore it.
how to add languange in this packages but im use in google colab?
Can we focus on Python3 and drop support for Python2 since it has reached end of life?
It would allow the project to use the newer features.
And maybe we should also choose which Python3 version we support? How about >=3.6?
May this fork rejuvante the package. :-)
Edit: I noticed the typo in the last line afterwards. I will leave it there. :D
Hiya,
First of all, big thanks for all the work on this project, it's awesome and super-lightweight, making it perfect for my project, which is going to be used to support hundreds of children with their education.
My project crashed today though, and in debugging I found it was because of autocorrect. Here's a minimally reproducible example of the error:
from autocorrect import Speller
autocorrector = Speller()
part_given = ""
part_given = autocorrector.autocorrect_word(part_given)
I was expecting this to keep part_given as "", but instead it throws up an IndexError.
With best wishes,
JimmyCarlos.
I added portuguese language, for the pt.tar.gz file, should I push it or create a dropbox link for it and add it to the constants ?
I tried pip install autocorrect
didn't work so I downloaded the code from Github and ran
pip install autocorrect-master
gave me the same log and problems
I tried all the versions all didn't work
I have no idea what I'm missing
Hi, I was wondering if autocorrect can utilize multiple languages simultaneously in a multi-lingual setup.
For example,
from autocorrect import Speller spellcheck = Speller(lang='en', 'es', 'fr') corrected_text = spellcheck(text)
Is there a way to measure or evaluate how much the system has done the spelling correction correctly?
Hi.
(from Wikipedia) The idea is to only use one hand (preferably the left one) and type the right-hand letters by holding a key that acts as a modifier key. The layout is mirrored, so the use of the muscle memory of the other hand is possible, which greatly reduces the amount of time needed to learn the layout if the person previously used both hands to type. This was first proposed by Randall Munroe on the xkcd-blog.
I would like it for when I type a word such as "cwwdee" it can take into consideration of the mirrored QWERTY setup autocorrect it to "cookie"
Hi I have a proper name 'SO' that is getting corrected to 'S' after using the Speller class. Anyway around this?
Q1: How can I check if a word exist in the dictionary ?
Q2: what method 'existing' is doing ?
Hi, i'm trying to load and run spell check with italian language (supported) with current script
from autocorrect import Speller
spell = Speller(lang='it')
result = spell('Ciaa da uma perzona itaviana')
print(result);
....but i receive the following error
Can you help us? We're building a big project upon this library.
Wish you my best regards,
Sebastian
Hello, i am trying to add Azerbaijani language, but i can't manage to do it.
I added this "az": r"[AaBbCcÇçDdEeƏəFfGgĞğHhXxIıİiJjKkQqLlMmNnOoÖöPpRrSsŞşTtUuÜüVvYyZz]+", to words_reges
Added this "az": "abcdefghijklmnopqrstuvxyzəüöğşçı", to alphabets
Then downloaded azwiki-latest-pages-articles.xml
When i run this code, it gives me error :
from autocorrect.word_count import count_words
count_words('azwiki-latest-pages-articles.xml', 'salam')
PS: "salam" means "hello" in my language
Follwing error appears:
Traceback (most recent call last):
File "C:/Users/User/AppData/Local/Programs/Python/Python39/spell.py", line 5, in
count_words('azwiki-latest-pages-articles.xml', 'salam')
File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\autocorrect\word_count.py", line 19, in count_words
counts = Counter(words)
File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\collections_init_.py", line 593, in init
self.update(iterable, **kwds)
File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\collections_init_.py", line 679, in update
_count_elements(self, iterable)
File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\autocorrect\word_count.py", line 9, in get_words
word_regex = word_regexes[lang]
KeyError: 'salam'
Any way to fix this ?
I want to add new words for autocorrelation. For example, I want to add strings like 'tive', 'fiue' that autocorrects to string 'five'. How can I achieve that?
I followed your instructions but I got an error when I try to count words from language corpus.
I attach a screenshot here, but the error log is the following
~/miniconda3/envs/cs/lib/python3.8/site-packages/autocorrect/word_count.py in get_words(filename, lang, encd)
7
8 def get_words(filename, lang, encd):
----> 9 word_regex = word_regexes[lang]
10 capitalized_regex = r'(\.|^|<|"|\'|\(|\[|\{)\s*' + word_regexes[lang]
11 with open(filename, encoding=encd) as file:
KeyError: 'fr'
My guess is that this is related to the key languages present in word_regexes
dict.
i have followed your instructions and tried to create Hindi spell checker . But it does not seem to work, Please share me your email id
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.