Comments (5)
Now I used:
# Download NLTK resources
nltk.download('words')
nltk.download('wordnet')
# Load NLTK words and WordNet lemmas
nltk_words = set(words.words())
wordnet_lemmas = set(lemma.name() for synset in wordnet.all_synsets() for lemma in synset.lemmas())
# Load Spacy model
nlp = spacy.load("en_core_web_sm")
# Extract Spacy vocabulary.
spacy_words = set([str(token) for token in nlp.vocab])
# Combine all sets
word_list = nltk_words.union(wordnet_lemmas).union(spacy_words)
But still:
thoughts (False) 👎
Finally:
programming (True)
from nltk.
Now I used:
def load_and_combine_word_sets():
"""
Load words from NLTK, WordNet, SpaCy, and an XML file, then combine them into a single set.
If a previously combined word set exists, it will be loaded from a file.
"""
file_path = "_helps/word_list.pkl"
# Check if a previously combined word set exists
if os.path.exists(file_path):
with open(file_path, "rb") as f:
return pickle.load(f)
# Load NLTK words and WordNet lemmas
nltk_words = set(words.words())
wordnet_lemmas = set(
lemma.name() for synset in wordnet.all_synsets() for lemma in synset.lemmas()
)
# Load SpaCy model and extract vocabulary
nlp = spacy.load("en_core_web_sm")
spacy_words = set([str(token) for token in nlp.vocab])
# Parse the XML file and extract words
tree = ET.parse("_helps/english-wordnet-2022.xml")
root = tree.getroot()
new_words = set()
for synset in root.findall(".//Synset"):
example = synset.find("Example")
if example is not None:
example_text = example.text
example_words = set(word_tokenize(example_text))
new_words.update(example_words)
# Combine all sets and add additional words
combined_word_list = (
nltk_words.union(wordnet_lemmas)
.union(spacy_words)
.union(new_words)
.union(additional_words)
)
# Save the combined word set to a file
with open(file_path, "wb") as f:
pickle.dump(combined_word_list, f)
print("Loaded word_list in load_and_combine_word_sets.")
return combined_word_list
but still time to time I have words missing.
Is there a more extensive list of English words? Right now I have about 348,000.
from nltk.
@BaGRoS , both words and wordnet only list uninflected word forms (i.e. lemmas). So, to look up inflected forms such as plurals, you need a lemmatizer, rather than a more extensive list.
from nltk.
close this issue
from nltk.
@BaGRoS , to recognize inflected word forms, you need to pre-process your input words with a lemmatizer. For ex:
>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> print(wnl.lemmatize('dogs'))
dog
from nltk.
Related Issues (20)
- ToktokTokenizer doesn't call one of the included replacement patterns and thus doesn't tokenize some punctuation, like opening guillemets HOT 1
- `corpus_bleu` function does not catch all the expections when calling `weights[0][0]` HOT 3
- Bug in nltk.draw.dispersion_plot with nltk 3.8.1, matplotlib-base 3.8.0, matplotlib-inline 0.1.6 and numpy 1.26 HOT 2
- Tokenizer punkt zip file sometimes does not unpackage
- `TreebankWordDetokenizer().detokenize()` introduces unexpected spaces before periods.
- KneserNeyInterpolated has problem with OOV words during testing and perplexity is always inf HOT 7
- Dispersion Plot was not populating in correct order on Y axis. I have corrected that order. Please use the below code in dispersion.py file. HOT 2
- Not able to download the NLTK data module (python as well as manual download) HOT 2
- import error with numpy 1.24.4 HOT 3
- A potential edge case for WordNetLemmatizer.lemmatize() HOT 1
- module 'nltk' has no attribute 'data HOT 2
- Fraction object creation fails with extra kwargs in bleu_score.py HOT 2
- i want to write python script i have italian text files that who i verify my word in italian dictionery please solve HOT 1
- Reversed y labels in dispersion_plot HOT 2
- Best NLTK books
- It would be nice to have a mapping from arpabet to IPA for the cmudict HOT 3
- stem accuracy HOT 1
- UTF-8 codec can't decode byte 0×e9 in position 122
- Duplicates in wordnet hypernyms closure
- Failed to run post install script for guardrails/toxic_language HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nltk.