Comments (13)
@arsho You'd still have to load the data, wouldn't you, whether it's plain text or JSON? As for searching performance, how about constructing a set/prefix-tree/dict/[whatever data structure you need] upon loading the file?
from english-words.
You are right. I still have to load the data and have to use a faster data structure. I have solved the issue by creating a json file named words_dictionary.json from words_alpha.txt which can be directly loaded as dictionary. See the pull request #22
from english-words.
@arsho I guess we'll have to wait for the maintainer to chime in, but why add a Python file to this library?
Beyond that, I suspect it might be faster to simply load the txt file that already exists and then creating a set
from it (rather than a dict, since you don't need the values)
with open('path/to/words.txt', 'r') as f
words = set(f.readlines())
from english-words.
@SwiftsNamesake , thanks for the demonstration of using set. Yes, set is good. But for searching and matching words frequently from the word list, dictionary shows better performance. Moreover, for frequent search it is easier to use a dictionary with the additional value attribute to store some information about the word like search frequency, last search history etc.
Still if one who will to use set here is an example that resolves new line issue of above program:
with open("words_alpha.txt","r") as f:
english_words = set([word.strip() for word in f.readlines()])
search_word = ["unumpired","sansculotte","nonstriking","notindict"]
for word in search_word:
print(word, word in english_words)
I think there is no problem to add a language specific example (here the read_english_dictionary.py python file) to this library. It will help the new comers to understand how to use this huge list of words smoothly. As far as I have seen, most of the popular libraries/data sets have language specific example on how to use them.
from english-words.
@arsho For inserting and searching for words present a trie should probably be used, although a dictionary may fare better than a set depending on the set's implementation (dictionaries are generally sorted but a set may not be). Issue with presenting an example using a trie is a trie is a data structure which is not as well known as a list, dictionary, or set and the audience's familiarity with a data structure's interface should be considered when writing examples for a general audience.
from english-words.
@arsho Thank you for pointing out the bug in my code snippet. I would still go with the generator expression, rather than the list comprehension, though.
set(word.strip() for word in f.readlines())
The reason I suggested a set
is that you don't seem to be using the values at all. If users need a dictionary with additional information about each word, they can create one themselves. The placeholder values won't help in that regard.
Sets are made for inclusion checks, so I'm surprised to hear that they're less efficient than dicts. Aren't they based on the same data-structure? Why would sets be slower than dicts?
I didn't read the message carefully, so I didn't realise that the .py file was merely a demo. Sorry about that. However, there is already a description of the data layout in the README (new-line-delimited text). How about a few brief usage examples in the README itself, for a few common languages?
Newline-separated text is a fairly simple format to process. Any person who wishes to use this library can simply load the .txt
file and transform it to whatever data-structure they like. I very much doubt that parsing JSON is more efficient than f.readlines()
and turning f.readlines()
into a dict is a one-liner anyway. The JSON format just adds additional noise, imho.
TL;DR
- What are the advantages of a separate JSON file, with placeholder values?
- How about a few brief usage examples in the README?
- Why are dicts faster than sets?
from english-words.
@SwiftsNamesake Since a set requires the items be hashable, I can't think of a reason the dictionary have a faster look-up time than a set (since the dictionary also uses the hash of the key to determine storage location).
from english-words.
@FluxIX Yes, that is the point I was making to @arsho. Unless he has proof to the contrary, of course (in which case I'm all ears).
from english-words.
@SwiftsNamesake It also wouldn't be difficult to set up a test harness to determine how long the accesses take.
from english-words.
@FluxIX I'm just waiting for @arsho to respond atm.
For what it's worth, I did some cursory research, and couldn't find any data to support his claim.
from english-words.
@SwiftsNamesake and @FluxIX , thanks a lot. I have studied the time and space complexity of set and dict and found that set and dict have both O(1) time complexity to search existence of an element. [1]
- I needed to associate values to the words in my projects. That is the reason I have created JSON file with placeholder value which can be loaded and updated easily. Plus in Python 3.6 dictionary is sorted which optimizes the time to sort the set which is not ordered.
- Of course, examples can be added in the readme for the popular languages. But in my view, the ready made language specific files are more appropriate than readme code snippets.
Ref 1. https://wiki.python.org/moin/TimeComplexity
from english-words.
FWIW, I don't think it's the job of a "word list" project to provide custom formats to compensate for the performance characteristics of every downstream computing environment.
Recall the single responsibility principle. The job of this repo is to provide a machine-readable list of English words. It is everyone else's job to find the best way for their software to consume that list. Consider: I built a mobile app that uses a dictionary of 267k words. To workaround mobile limitations, I broke the list into 11 files -- that wasn't the job of my wordlist source, it was my job. Similarly here: if your Python script runs too slow, write a faster script, or create a pre-processor that transforms the raw list into a format that suits your needs (which is how I did it).
A UTF-8 encoded, new-line-delimited text file is machine readable on every computing platform. This project should restrict itself to maintaining the integrity of the data, not exporting it in everybody's favorite flavor.
from english-words.
This ticket should probably be closed, since a patch addressing it has merged to master.
from english-words.
Related Issues (20)
- The words cir and cyrano are swapped. HOT 1
- Doesn't contain floccinaucinihilipilification
- How is this copyrightable? HOT 1
- Dictionary is not in lexicographic order! (words_alpha.txt)
- Doesn't contain "unevaluable"
- missing words and plurals, a handful of misspellings, and incorrectly placed hyphens HOT 2
- What does "etwite" mean? HOT 2
- Missing word courgettes
- Referencing Bookworm Deluxe, 15069 words are missing from words_alpha.txt
- Contains common spelling error "ocurred".
- O
- How is 2 a word? HOT 3
- Missing Words HOT 3
- Seemingly invalid word: "greing"
- Missing Nudiustertian & Petrichor & Hippopotomonstrosesquippedaliophobia
- Several non-English words made it into the list HOT 4
- Error Words HOT 3
- A lot of gibberish HOT 2
- Added to DiceWords repo -- Requesting how you'd like to be credited
- Roman numerals are in the file that shouldn't contain "numbers"
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from english-words.