Git Product home page Git Product logo

Comments (4)

mammothb avatar mammothb commented on July 26, 2024 1

but I was using https://github.com/wolfgarbe/SymSpell/blob/master/SymSpell/frequency_dictionary_en_82_765.txt

The dictionary file from the original SymSpell repository is saved with the UTF-8-BOM encoding. And load_dictionary() opens the file using UTF-8 encoding by default. This could have resulted in the extra characters in the first line.

The dictionary file provided by this repository is in UTF-8 encoding and should loaded properly.

Also, could you try using

sym_spell.load_dictionary("frequency_dictionary_en_82_765.txt", 0, 1, encoding="utf_8_sig")

with the dictionary file from the original SymSpell repo (without any modifications) and see if it loads properly?

from symspellpy.

mammothb avatar mammothb commented on July 26, 2024

Do you have a sample code snippet which can show the error? And also, may I know how did you obtain "frequency_dictionary_en_82_765.txt" file, i.e., simply download or copy/paste into a new file?

I have trouble replicating this error you have described. This code snippet downloads the file from github:

from pathlib import Path

import requests

r = requests.get(
    "https://raw.githubusercontent.com/mammothb/symspellpy/master/symspellpy/"
    "frequency_dictionary_en_82_765.txt"
)

path = Path.cwd() / "frequency_dictionary_en_82_765.txt"
with open(path, "wb") as outfile:
    outfile.write(r.content)

with open(path, "r") as infile:
    print(infile.readlines()[0])

Outputs:

the 23135851162

I tried the sample code snippet from the documentation , it also does not show the error:

[('the', 23135851162), ('of', 13151942776), ('and', 12997637966), ('to', 12136980858), ('a', 9081174698)]

from symspellpy.

crazoter avatar crazoter commented on July 26, 2024

I have similarly experienced this problem with symspellpy==6.7.6, but I was using https://github.com/wolfgarbe/SymSpell/blob/master/SymSpell/frequency_dictionary_en_82_765.txt which I downloaded manually and added a few entries manually to the end of the file (it's possible that there may be duplicate entries, but i don't think this is the cause).

The files appear identical so I didn't really know what was the issue, but thought you might be interested.

Code snippet:

from symspellpy import SymSpell, Verbosity
sym_spell = SymSpell(max_dictionary_edit_distance=6)    
# https://symspellpy.readthedocs.io/en/latest/api/symspellpy.html#symspellpy.symspellpy.SymSpell.load_dictionary
sym_spell.load_dictionary("frequency_dictionary_en_82_765.txt", 0, 1)
suggestions = sym_spell.lookup("the", Verbosity.CLOSEST, max_edit_distance=6)

frequency_dictionary_en_82_765.txt

I resolved this quite easily by adding a new line as the first line.

from symspellpy.

crazoter avatar crazoter commented on July 26, 2024

Interesting, I wouldn't have thought that it was a problem related to the encoding until you mention it. Adding the encoding parameter indeed fixes the issue for the dictionary in the original SymSpell.

from symspellpy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.