Git Product home page Git Product logo

lexpy's Introduction

Lexpy

lexpy Downloads PyPI version

Python 3.7 Python 3.8 Python 3.9 Python 3.10 Python 3.11 Python 3.12

PyPy3.7 PyPy3.8 PyPy3.9

  • A lexicon is a data-structure which stores a set of words. The difference between a dictionary and a lexicon is that in a lexicon there are no values associated with the words.

  • A lexicon is similar to a list or a set of words, but the internal representation is different and optimized for faster searches of words, prefixes and wildcard patterns.

  • Given a word, precisely, the search time is O(W) where W is the length of the word.

  • 2 important lexicon data-structures are Trie and Directed Acyclic Word Graph (DAWG).

Install

lexpy can be installed via Python Package Index (PyPI) using pip. The only installation requirement is that you need Python 3.7 or higher.

pip install lexpy

Interface

Interface Description Trie DAWG
Add a single word add('apple', count=2) add('apple', count=2)
Add multiple words add_all(['advantage', 'courage']) add_all(['advantage', 'courage'])
Check if exists? in operator in operator
Search using wildcard expression search('a?b*', with_count=True) search('a?b*, with_count=True)
Search for prefix matches search_with_prefix('bar', with_count=True) search_with_prefix('bar')
Search for similar words within given edit distance. Here, the notion of edit distance is same as Levenshtein distance search_within_distance('apble', dist=1, with_count=True) search_within_distance('apble', dist=1, with_count=True)
Get the number of nodes in the automaton len(trie) len(dawg)

Examples

Trie

Build from an input list, set, or tuple of words.

from lexpy import Trie

trie = Trie()

input_words = ['ampyx', 'abuzz', 'athie', 'athie', 'athie', 'amato', 'amato', 'aneto', 'aneto', 'aruba', 
               'arrow', 'agony', 'altai', 'alisa', 'acorn', 'abhor', 'aurum', 'albay', 'arbil', 'albin', 
               'almug', 'artha', 'algin', 'auric', 'sore', 'quilt', 'psychotic', 'eyes', 'cap', 'suit', 
               'tank', 'common', 'lonely', 'likeable' 'language', 'shock', 'look', 'pet', 'dime', 'small' 
               'dusty', 'accept', 'nasty', 'thrill', 'foot', 'steel', 'steel', 'steel', 'steel', 'abuzz']

trie.add_all(input_words) # You can pass any sequence types or a file-like object here

print(trie.get_word_count())

>>> 48

Build from a file or file path.

In the file, words should be newline separated.

from lexpy import Trie

# Either
trie = Trie()
trie.add_all('/path/to/file.txt')

# Or
with open('/path/to/file.txt', 'r') as infile:
     trie.add_all(infile)

Check if exists using the in operator

print('ampyx' in trie)

>>> True

Prefix search

print(trie.search_with_prefix('ab'))

>>> ['abhor', 'abuzz']
print(trie.search_with_prefix('ab', with_count=True))

>>> [('abuzz', 2), ('abhor', 1)]

Wildcard search using ? and *

  • ? = 0 or 1 occurrence of any character

  • * = 0 or more occurrence of any character

print(trie.search('a*o*'))

>>> ['amato', 'abhor', 'aneto', 'arrow', 'agony', 'acorn']

print(trie.search('a*o*', with_count=True))

>>> [('amato', 2), ('abhor', 1), ('aneto', 2), ('arrow', 1), ('agony', 1), ('acorn', 1)]

print(trie.search('su?t'))

>>> ['suit']

print(trie.search('su?t', with_count=True))

>>> [('suit', 1)]

Search for similar words using the notion of Levenshtein distance

print(trie.search_within_distance('arie', dist=2))

>>> ['athie', 'arbil', 'auric']

print(trie.search_within_distance('arie', dist=2, with_count=True))

>>> [('athie', 3), ('arbil', 1), ('auric', 1)]

Increment word count

  • You can either add a new word or increment the counter for an existing word.
trie.add('athie', count=1000)

print(trie.search_within_distance('arie', dist=2, with_count=True))

>>> [('athie', 1003), ('arbil', 1), ('auric', 1)]

Directed Acyclic Word Graph (DAWG)

  • DAWG supports the same set of operations as a Trie. The difference is the number of nodes in a DAWG is always less than or equal to the number of nodes in Trie.

  • They both are Deterministic Finite State Automata. However, DAWG is a minimized version of the Trie DFA.

  • In a Trie, prefix redundancy is removed. In a DAWG, both prefix and suffix redundancies are removed.

  • In the current implementation of DAWG, the insertion order of the words should be alphabetical.

  • The implementation idea of DAWG is borrowed from http://stevehanov.ca/blog/?id=115

from lexpy import Trie, DAWG

trie = Trie()
trie.add_all(['advantageous', 'courageous'])

dawg = DAWG()
dawg.add_all(['advantageous', 'courageous'])

len(trie) # Number of Nodes in Trie
23

dawg.reduce() # Perform DFA minimization. Call this every time a chunk of words are uploaded in DAWG.

len(dawg) # Number of nodes in DAWG
21

DAWG

The APIs are exactly same as the Trie APIs

Build a DAWG

from lexpy import DAWG
dawg = DAWG()

input_words = ['ampyx', 'abuzz', 'athie', 'athie', 'athie', 'amato', 'amato', 'aneto', 'aneto', 'aruba', 
               'arrow', 'agony', 'altai', 'alisa', 'acorn', 'abhor', 'aurum', 'albay', 'arbil', 'albin', 
               'almug', 'artha', 'algin', 'auric', 'sore', 'quilt', 'psychotic', 'eyes', 'cap', 'suit', 
               'tank', 'common', 'lonely', 'likeable' 'language', 'shock', 'look', 'pet', 'dime', 'small' 
               'dusty', 'accept', 'nasty', 'thrill', 'foot', 'steel', 'steel', 'steel', 'steel', 'abuzz']


dawg.add_all(input_words)
dawg.reduce()

dawg.get_word_count()

>>> 48

Check if exists using the in operator

print('ampyx' in dawg)

>>> True

Prefix search

print(dawg.search_with_prefix('ab'))

>>> ['abhor', 'abuzz']
print(dawg.search_with_prefix('ab', with_count=True))

>>> [('abuzz', 2), ('abhor', 1)]

Wildcard search using ? and *

? = 0 or 1 occurance of any character

* = 0 or more occurance of any character

print(dawg.search('a*o*'))

>>> ['amato', 'abhor', 'aneto', 'arrow', 'agony', 'acorn']

print(dawg.search('a*o*', with_count=True))

>>> [('amato', 2), ('abhor', 1), ('aneto', 2), ('arrow', 1), ('agony', 1), ('acorn', 1)]

print(dawg.search('su?t'))

>>> ['suit']

print(dawg.search('su?t', with_count=True))

>>> [('suit', 1)]

Search for similar words using the notion of Levenshtein distance

print(dawg.search_within_distance('arie', dist=2))

>>> ['athie', 'arbil', 'auric']

print(dawg.search_within_distance('arie', dist=2, with_count=True))

>>> [('athie', 3), ('arbil', 1), ('auric', 1)]

Alphabetical order insertion

If you insert a word which is lexicographically out-of-order, ValueError will be raised.

dawg.add('athie', count=1000)

ValueError

ValueError: Words should be inserted in Alphabetical order. <Previous word - thrill>, <Current word - athie>

Increment the word count

  • You can either add an alphabetically greater word with a specific count or increment the count of the previous added word.
dawg.add_all(['thrill']*20000) # or dawg.add('thrill', count=20000)

print(dawg.search('thrill', with_count=True))

>> [('thrill', 20001)]

Special Characters

Special characters, except ? and *, are matched literally.

from lexpy import Trie
t = Trie()
t.add('a©')
t.search('a©')
>> ['a©']
t.search('a?')
>> ['a©']
t.search('?©')
>> ['a©']

Trie vs DAWG

Number of nodes comparison

Build time comparison

Future Work

These are some ideas which I would love to work on next in that order. Pull requests or discussions are invited.

  • Merge trie and DAWG features in one data structure
    • Support all functionalities and still be as compressed as possible.
  • Serialization / Deserialization
    • Pickle is definitely an option.
  • Server (TCP or HTTP) to serve queries over the network.

Fun Facts

  1. The 45-letter word pneumonoultramicroscopicsilicovolcanoconiosis is the longest English word that appears in a major dictionary. So for all english words, the search time is bounded by O(45).
  2. The longest technical word(not in dictionary) is the name of a protein called as titin. It has 189,819 letters and it is disputed whether it is a word.

lexpy's People

Contributors

aosingh avatar tomsonboylett avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

lexpy's Issues

Incorrect order of answers when using the wildcard '*' in DAWG

Hi,

I wonder if there is a small issue in the file automata.py, function __words_with_wildcard, between lines 128 and 147, when the case letter=='*' is processed.

If the dictionary is made of, for example, "CHIAC" and "CHIC", and the query is "CHI*C", the result will be return in an incorrect alphabetical order : "CHIC" then "CHIAC".

This is because the case words_at_current_level is processed before checking the children.

So, for "CHI*C",

  • words_at_current_level will first find "CHIC"
  • then the loop for child in node.children will find "CHIAC", resulting in an incorrect order of answers.

Any idea? Or maybe did I misunderstood the code?

Best,
Lionel

Dawg nodeid issue

HI,

There is an issue with your Dawg data-structure. The maximum nodeId that it can reach is 2. I tried to print the nodeid and val while inserting and here is the output i got

id and val at current is 1
id and val at current is 2 a
id and val at current is 2 n
id and val at current is 2 p
id and val at current is 2 e
id and val at current is 2 p
id and val at current is 2 l
id and val at current is 2 e
id and val at current is 2 b
id and val at current is 2 a
id and val at current is 2 n
id and val at current is 2 a
id and val at current is 2 n
id and val at current is 2 a
id and val at current is 2 t
id and val at current is 2 c
id and val at current is 2 a
id and val at current is 2 n
id and val at current is 2 a
id and val at current is 2 n
id and val at current is 2 a

it is not creating a child id after first child from root

The wildcard pattern `?*` should be treated the same way as `*?`

In Lexpy, ? means "zero or one character" and * means "zero or more characters". Based on this, why is the pattern ?* considered illegal while *? is allowed? Don't they both have the same semantics here:

*?: zero or more || zero or one -> zero || zero, zero || one, more || zero, more || one -> zero, one, more -> zero or more
?*: zero or one || zero or more -> zero || zero, zero || more, one || zero, one || more -> zero, more, one -> zero or more

The code at _utils.py#L15 already translates *? to *, why isn't this also done for ?*?

result = re.sub('(\*\?)+', '*', result) # Replace consecutive '*?' with a single group '*'

adding word information

As an update is there any way that we can add some information about a word for exmaple wordcount in a text or POS tag or something and result both the searched word and its value

example input: arc:1200
art:1450
bar:2300

example output: dawg.search_with_prefix("a")
[arc:1200,art:1450]

search_with_suffix function

Do you think there's a way to implement a search_with_suffix function that looks for words in the DAWG that contain some suffix? Also is there a way to search the DAWG for words that contain a substring? For instance, if I wanted words that contained the substring "ST," the function would return "first," "star," and "sophisticated" Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.