Git Product home page Git Product logo

Comments (5)

BaGRoS avatar BaGRoS commented on September 20, 2024

Now I used:

# Download NLTK resources
nltk.download('words')
nltk.download('wordnet')

# Load NLTK words and WordNet lemmas
nltk_words = set(words.words())
wordnet_lemmas = set(lemma.name() for synset in wordnet.all_synsets() for lemma in synset.lemmas())

# Load Spacy model
nlp = spacy.load("en_core_web_sm")

# Extract Spacy vocabulary.
spacy_words = set([str(token) for token in nlp.vocab])

# Combine all sets
word_list = nltk_words.union(wordnet_lemmas).union(spacy_words)

But still:
thoughts (False) 👎

Finally:
programming (True)

from nltk.

BaGRoS avatar BaGRoS commented on September 20, 2024

Now I used:

def load_and_combine_word_sets():
    """
    Load words from NLTK, WordNet, SpaCy, and an XML file, then combine them into a single set.
    If a previously combined word set exists, it will be loaded from a file.
    """
    file_path = "_helps/word_list.pkl"

    # Check if a previously combined word set exists
    if os.path.exists(file_path):
        with open(file_path, "rb") as f:
            return pickle.load(f)

    # Load NLTK words and WordNet lemmas
    nltk_words = set(words.words())
    wordnet_lemmas = set(
        lemma.name() for synset in wordnet.all_synsets() for lemma in synset.lemmas()
    )

    # Load SpaCy model and extract vocabulary
    nlp = spacy.load("en_core_web_sm")
    spacy_words = set([str(token) for token in nlp.vocab])

    # Parse the XML file and extract words
    tree = ET.parse("_helps/english-wordnet-2022.xml")
    root = tree.getroot()
    new_words = set()
    for synset in root.findall(".//Synset"):
        example = synset.find("Example")
        if example is not None:
            example_text = example.text
            example_words = set(word_tokenize(example_text))
            new_words.update(example_words)

    # Combine all sets and add additional words
    combined_word_list = (
        nltk_words.union(wordnet_lemmas)
        .union(spacy_words)
        .union(new_words)
        .union(additional_words)
    )

    # Save the combined word set to a file
    with open(file_path, "wb") as f:
        pickle.dump(combined_word_list, f)

    print("Loaded word_list in load_and_combine_word_sets.")
    return combined_word_list

but still time to time I have words missing.
Is there a more extensive list of English words? Right now I have about 348,000.

from nltk.

ekaf avatar ekaf commented on September 20, 2024

@BaGRoS , both words and wordnet only list uninflected word forms (i.e. lemmas). So, to look up inflected forms such as plurals, you need a lemmatizer, rather than a more extensive list.

from nltk.

Higgs32584 avatar Higgs32584 commented on September 20, 2024

close this issue

from nltk.

ekaf avatar ekaf commented on September 20, 2024

@BaGRoS , to recognize inflected word forms, you need to pre-process your input words with a lemmatizer. For ex:

 >>> from nltk.stem import WordNetLemmatizer
 >>> wnl = WordNetLemmatizer()
 >>> print(wnl.lemmatize('dogs'))
 dog

from nltk.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.