While working with words_alpha.txt in Python 3.6, I have faced slow loading and slow s

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Slow loading of words_alpha.txt in Python about english-words HOT 13 CLOSED

dwyl commented on July 16, 2024

Slow loading of words_alpha.txt in Python

from english-words.

Comments (13)

SwiftsNamesake commented on July 16, 2024

@arsho You'd still have to load the data, wouldn't you, whether it's plain text or JSON? As for searching performance, how about constructing a set/prefix-tree/dict/[whatever data structure you need] upon loading the file?

from english-words.

arsho commented on July 16, 2024

You are right. I still have to load the data and have to use a faster data structure. I have solved the issue by creating a json file named words_dictionary.json from words_alpha.txt which can be directly loaded as dictionary. See the pull request #22

from english-words.

SwiftsNamesake commented on July 16, 2024

@arsho I guess we'll have to wait for the maintainer to chime in, but why add a Python file to this library?

Beyond that, I suspect it might be faster to simply load the txt file that already exists and then creating a set from it (rather than a dict, since you don't need the values)

with open('path/to/words.txt', 'r') as f
   words = set(f.readlines())

from english-words.

arsho commented on July 16, 2024

@SwiftsNamesake , thanks for the demonstration of using set. Yes, set is good. But for searching and matching words frequently from the word list, dictionary shows better performance. Moreover, for frequent search it is easier to use a dictionary with the additional value attribute to store some information about the word like search frequency, last search history etc.

Still if one who will to use set here is an example that resolves new line issue of above program:

with open("words_alpha.txt","r") as f:
    english_words = set([word.strip() for word in f.readlines()])
    search_word = ["unumpired","sansculotte","nonstriking","notindict"]
    for word in search_word:
        print(word, word in english_words)

I think there is no problem to add a language specific example (here the read_english_dictionary.py python file) to this library. It will help the new comers to understand how to use this huge list of words smoothly. As far as I have seen, most of the popular libraries/data sets have language specific example on how to use them.

from english-words.

FluxIX commented on July 16, 2024

@arsho For inserting and searching for words present a trie should probably be used, although a dictionary may fare better than a set depending on the set's implementation (dictionaries are generally sorted but a set may not be). Issue with presenting an example using a trie is a trie is a data structure which is not as well known as a list, dictionary, or set and the audience's familiarity with a data structure's interface should be considered when writing examples for a general audience.

from english-words.

SwiftsNamesake commented on July 16, 2024

@arsho Thank you for pointing out the bug in my code snippet. I would still go with the generator expression, rather than the list comprehension, though.

set(word.strip() for word in f.readlines())

The reason I suggested a set is that you don't seem to be using the values at all. If users need a dictionary with additional information about each word, they can create one themselves. The placeholder values won't help in that regard.

Sets are made for inclusion checks, so I'm surprised to hear that they're less efficient than dicts. Aren't they based on the same data-structure? Why would sets be slower than dicts?

I didn't read the message carefully, so I didn't realise that the .py file was merely a demo. Sorry about that. However, there is already a description of the data layout in the README (new-line-delimited text). How about a few brief usage examples in the README itself, for a few common languages?

Newline-separated text is a fairly simple format to process. Any person who wishes to use this library can simply load the .txt file and transform it to whatever data-structure they like. I very much doubt that parsing JSON is more efficient than f.readlines() and turning f.readlines() into a dict is a one-liner anyway. The JSON format just adds additional noise, imho.

TL;DR

What are the advantages of a separate JSON file, with placeholder values?
How about a few brief usage examples in the README?
Why are dicts faster than sets?

from english-words.

FluxIX commented on July 16, 2024

@SwiftsNamesake Since a set requires the items be hashable, I can't think of a reason the dictionary have a faster look-up time than a set (since the dictionary also uses the hash of the key to determine storage location).

from english-words.

SwiftsNamesake commented on July 16, 2024

@FluxIX Yes, that is the point I was making to @arsho. Unless he has proof to the contrary, of course (in which case I'm all ears).

from english-words.

FluxIX commented on July 16, 2024

@SwiftsNamesake It also wouldn't be difficult to set up a test harness to determine how long the accesses take.

from english-words.

SwiftsNamesake commented on July 16, 2024

@FluxIX I'm just waiting for @arsho to respond atm.

For what it's worth, I did some cursory research, and couldn't find any data to support his claim.

from english-words.

arsho commented on July 16, 2024

@SwiftsNamesake and @FluxIX , thanks a lot. I have studied the time and space complexity of set and dict and found that set and dict have both O(1) time complexity to search existence of an element. [1]

@SwiftsNamesake ,

I needed to associate values to the words in my projects. That is the reason I have created JSON file with placeholder value which can be loaded and updated easily. Plus in Python 3.6 dictionary is sorted which optimizes the time to sort the set which is not ordered.
Of course, examples can be added in the readme for the popular languages. But in my view, the ready made language specific files are more appropriate than readme code snippets.

Ref 1. https://wiki.python.org/moin/TimeComplexity

from english-words.

tomprogers commented on July 16, 2024

FWIW, I don't think it's the job of a "word list" project to provide custom formats to compensate for the performance characteristics of every downstream computing environment.

Recall the single responsibility principle. The job of this repo is to provide a machine-readable list of English words. It is everyone else's job to find the best way for their software to consume that list. Consider: I built a mobile app that uses a dictionary of 267k words. To workaround mobile limitations, I broke the list into 11 files -- that wasn't the job of my wordlist source, it was my job. Similarly here: if your Python script runs too slow, write a faster script, or create a pre-processor that transforms the raw list into a format that suits your needs (which is how I did it).

A UTF-8 encoded, new-line-delimited text file is machine readable on every computing platform. This project should restrict itself to maintaining the integrity of the data, not exporting it in everybody's favorite flavor.

from english-words.

tomprogers commented on July 16, 2024

This ticket should probably be closed, since a patch addressing it has merged to master.

from english-words.

Slow loading of words_alpha.txt in Python about english-words HOT 13 CLOSED

Comments (13)

TL;DR

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent