Hi, I use the following code to load NLTK words: <div class="sni

Now I used: <div class="snippet-clipboard-content notranslate position-relative ov

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Missing English words in words() about nltk HOT 5 CLOSED

BaGRoS commented on September 20, 2024

Missing English words in words()

from nltk.

Comments (5)

BaGRoS commented on September 20, 2024

Now I used:

# Download NLTK resources
nltk.download('words')
nltk.download('wordnet')

# Load NLTK words and WordNet lemmas
nltk_words = set(words.words())
wordnet_lemmas = set(lemma.name() for synset in wordnet.all_synsets() for lemma in synset.lemmas())

# Load Spacy model
nlp = spacy.load("en_core_web_sm")

# Extract Spacy vocabulary.
spacy_words = set([str(token) for token in nlp.vocab])

# Combine all sets
word_list = nltk_words.union(wordnet_lemmas).union(spacy_words)

But still:
thoughts (False) 👎

Finally:
programming (True)

from nltk.

BaGRoS commented on September 20, 2024

Now I used:

def load_and_combine_word_sets():
    """
    Load words from NLTK, WordNet, SpaCy, and an XML file, then combine them into a single set.
    If a previously combined word set exists, it will be loaded from a file.
    """
    file_path = "_helps/word_list.pkl"

    # Check if a previously combined word set exists
    if os.path.exists(file_path):
        with open(file_path, "rb") as f:
            return pickle.load(f)

    # Load NLTK words and WordNet lemmas
    nltk_words = set(words.words())
    wordnet_lemmas = set(
        lemma.name() for synset in wordnet.all_synsets() for lemma in synset.lemmas()
    )

    # Load SpaCy model and extract vocabulary
    nlp = spacy.load("en_core_web_sm")
    spacy_words = set([str(token) for token in nlp.vocab])

    # Parse the XML file and extract words
    tree = ET.parse("_helps/english-wordnet-2022.xml")
    root = tree.getroot()
    new_words = set()
    for synset in root.findall(".//Synset"):
        example = synset.find("Example")
        if example is not None:
            example_text = example.text
            example_words = set(word_tokenize(example_text))
            new_words.update(example_words)

    # Combine all sets and add additional words
    combined_word_list = (
        nltk_words.union(wordnet_lemmas)
        .union(spacy_words)
        .union(new_words)
        .union(additional_words)
    )

    # Save the combined word set to a file
    with open(file_path, "wb") as f:
        pickle.dump(combined_word_list, f)

    print("Loaded word_list in load_and_combine_word_sets.")
    return combined_word_list

but still time to time I have words missing.
Is there a more extensive list of English words? Right now I have about 348,000.

from nltk.

ekaf commented on September 20, 2024

@BaGRoS , both words and wordnet only list uninflected word forms (i.e. lemmas). So, to look up inflected forms such as plurals, you need a lemmatizer, rather than a more extensive list.

from nltk.

Higgs32584 commented on September 20, 2024

close this issue

from nltk.

ekaf commented on September 20, 2024

@BaGRoS , to recognize inflected word forms, you need to pre-process your input words with a lemmatizer. For ex:

 >>> from nltk.stem import WordNetLemmatizer
 >>> wnl = WordNetLemmatizer()
 >>> print(wnl.lemmatize('dogs'))
 dog

from nltk.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.

Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

TensorFlow

An Open Source Machine Learning Framework for Everyone

Django

The Web framework for perfectionists with deadlines.

Laravel

A PHP framework for web artisans

D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

web

Some thing interesting about web. New door for the world.

server

A server is a program made to process requests and deliver data to clients.

Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

Visualization

Some thing interesting about visualization, use data art

Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.

Microsoft

Open source projects and samples from Microsoft.

Google

Google ❤️ Open Source for everyone.

Alibaba

Alibaba Open Source for everyone

D3

Data-Driven Documents codes.

Tencent

China tencent open source team.

Comments (5)

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org