Git Product home page Git Product logo

word_forms's Introduction

word forms logo

## Accurately generate all possible forms of an English word

Word forms can accurately generate all possible forms of an English word. It can conjugate verbs. It can connect different parts of speeches e.g noun to adjective, adjective to adverb, noun to verb etc. It can pluralize singular nouns. It does this all in one function. Enjoy!

Examples

Some very timely examples :-P

>>> from word_forms.word_forms import get_word_forms
>>> get_word_forms("president")
>>> {'n': {'president', 'Presidents', 'President', 'presidentship', 'presidencies', 'presidency', 'presidentships', 'presidents'}, 
     'r': {'presidentially'}, 
     'a': {'presidential'}, 
     'v': {'presiding', 'presides', 'preside', 'presided'}}
>>> get_word_forms("elect")
>>> {'n': {'elector', 'elects', 'electors', 'elective', 'electorates', 'elect', 'electives', 'elections', 'electorate', 'eligibility', 'election', 'eligibilities'}, 
     'r': set(), 
     'a': {'elect', 'electoral', 'elective', 'eligible'}, 
     'v': {'elect', 'elects', 'electing', 'elected'}}
>>> get_word_forms("politician")
>>> {'r': {'politically'}, 
     'a': {'political'}, 
     'n': {'politicss', 'politician', 'politicians', 'politics'}, 
     'v': set()}
>>> get_word_forms("trump")
>>> {'n': {'trump', 'trumps', 'trumping', 'trumpings'}, 
     'r': set(), 
     'a': set(), 
     'v': {'trumped', 'trump', 'trumps', 'trumping'}}

As you can see, the output is a dictionary with four keys. "r" stands for adverb, "a" for adjective, "n" for noun and "v" for verb. Don't ask me why "r" stands for adverb. This is what WordNet uses, so this is why I use it too :-)

Help can be obtained at any time by typing the following:

>>> help(get_word_forms)

Why?

In Natural Language Processing and Search, one often needs to treat words like "run" and "ran", "love" and "lovable" or "politician" and "politics" as the same word. This is usually done by algorithmically reducing each word into a base word and then comparing the base words. The process is called Stemming. For example, the Porter Stemmer reduces both "love" and "lovely" into the base word "love".

Stemmers have several shortcomings. Firstly, the base word produced by the Stemmer is not always a valid English word. For example, the Porter Stemmer reduces the word "operation" to "oper". Secondly, the Stemmers have a high false negative rate. For example, "run" is reduced to "run" and "ran" is reduced to "ran". This happens because the Stemmers use a set of rational rules for finding the base words, and as we all know, the English language does not always behave very rationally.

Lemmatizers are more accurate than Stemmers because they produce a base form that is present in the dictionary (also called the Lemma). So the reduced word is always a valid English word. However, Lemmatizers also have false negatives because they are not very good at connecting words across different parts of speeches. The WordNet Lemmatizer included with NLTK fails at almost all such examples. "operations" is reduced to "operation" and "operate" is reduced to "operate".

Word Forms tries to solve this problem by finding all possible forms of a given English word. It can perform verb conjugations, connect noun forms to verb forms, adjective forms, adverb forms, plularize singular forms etc.

Compatibility

Works on both Python 2 and Python 3

Installation

1. Clone the repository.

git clone https://github.com/gutfeeling/word_forms.git

2. Install it using pip or setup.py install

pip install -e word_forms

OR

cd word_forms
python setup.py install

Alternatively, add this to your pip requirements file:

git+git://github.com/gutfeeling/word_forms.git#egg=word_forms

Acknowledgement

  1. The XTAG project for information on verb conjugations.
  2. WordNet

Maintainer

Hi, I am Dibya and I maintain this repository. I would love to hear from you. Feel free to get in touch with me at [email protected].

Contributions

Word Forms is not perfect. In particular, a couple of aspects can be improved.

  1. It sometimes generates non dictionary words like "politicss" because the pluralization/singularization algorithm is not perfect. At the moment, I am using inflect for it.

  2. A function has_same_base_form for comparing two words can be added. At the moment, the information that "run" and "ran" are connected can only be figured out by querying get_word_forms("run") and not get_word_forms("ran"). This could be solved by creating a database of equivalence classes using this package (if word forms is an equivalence relation).

If you like this package, feel free to contribute. Your pull requests are most welcome.

word_forms's People

Contributors

gutfeeling avatar camilosampedro avatar baragona avatar

Watchers

James Cloos avatar Akki avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.