bloomfilter.py

Description

This Github implements a Bloom filter based spell checker in python. This work started from a useful idea about improving coding capability from CodeKata.

It includes:

a python implementation of a Bloom Filter
input list for dictionary creation in iso-8859-1, Latin-1
logging in the Bloom Filter and in pytest
argparse to provide help with input parameters
unit tests using pytest

Bloom filters are interesting because they are a very space efficient way to find if something is a member of a set when the set can be quite large. False positives are possible but false negatives are not. Bloom Filters are explained in detail..

Summarized from CodeKata:

 Bloom filters work by initializing a big array of bits to zero.  Then, generate an independent hash value for each word in the list you want to look up, in this case, a list of dictionary words.  The hash is used to set the corresponding bit in the array of bits. It is not an issue if there are clashes in the bit array.

To check if a word is in the list, use the same hash that was used to set up the bitmap.  If the bits in the array corresponding to the hash are set,  the the word may be in the list. If not set, then that word is not in the list.

Sometimes a Bloom filter will report a false positive. This will not happen frequently as long as the bitmap is not heavily loaded.

The Bloom filter is full when at capacity:

 m * ((ln 2 ^ 2) / abs(ln p))

 where:
 m - number of bits in the bloom filter bit array
 p - the false positive probability

Python provides a Bloomfilter but the goal of this project was to learn about implementing a Bloom filter so one was implemented.

Usage

usage: python bloomfilter.py [-h] [--prob PROB] [--items ITEMS] [--showstats]

Try out a bloomfilter

optional arguments:
  -h, --help       show this help message and exit
  --prob PROB      Probability of false positive, integer percentages, 1 is 1
                   percent or 10 is 10 percent
  --items ITEMS    Number of items in dictionary to search
  --showstats      Use without value to show bloomfilter stats

This bloomfilter is not pip installable.
The dictionary file is in data/wordlist.txt
The words to search for are in data/searchwords.txt
During development, testing different dictionaries was accomplished by copying to these files.

Highlights

Number of bits and false positive rate

Given:

    n: how many items you expect to have in your filter (e.g. 216,553)
    p: your acceptable false positive rate {0..1} (e.g. 0.01 → 1%) we want to calculate:
    m: the number of bits needed in the bloom filter
    k: the number of hash functions we should apply

The formulas:

m = -n*ln(p) / (ln(2)^2) the number of bits

k = m/n * ln(2) the number of hash functions

Assumptions:

n = 400,000 
    Note: The dictionary list has 338,882 items. This number was rounded up for convenience.
p = 1%

Calculation info

Interesting Items

Character encoding

Joelonsoftware

Quick summary of that
hashing for Bloom filters is different than filters for cryptographic functions
hashing for Bloom filters should be very, very fast with uniform spread
Python Hash functions
Here's a paper that shows that you can similuate n hash functions with just 2 hash functions.
Python stores text in Unicode code points. A bit of knowledge is necessary to process these properly. I used a provided wordlist that was Latin-1. In particular, the process of converting from a sequence of bytes (i.e. binary data) to a sequence of code points (i.e. text data) is decoding, while the reverse process is encoding. Latin-1 maps byte values directly to the first 256 Unicode points. Note that Windows has it's own "latin-1" variant called cp1252 but does not map all 256 possible byte values. Converting between and learning the nuances was an interesting learning experience.
A docker container was used to run this code and the pytest. I learned to use vscode on my local system to connect to the docker container to leverage it's environment.

Unit Tests

Run unit tests:

    python -m pytest -v -s tests/unit/testbloom.py

Unit tests demonstrate testing of base functionality

Each and every word in the dictionary is not tested to confirm it returns true
false positives are hard to test

Interesting References

Intro to Bloom Filters

DevOps

No DevOps at this time.

bloomfilter note

it's difficult to write test for false positive However, how do we write unit tests for false positives? It’s impossible to write a conclusive unit test for the “includes” method since returning True does not necessarily mean the word exists. We can run the test repeatedly to determine the probability of getting a false positive and determine if that matches our theoretical calculations, but, as Prashanth brought up, this kind of testing isn’t unit testing because running the same test doesn’t always return True.

bethz / bloomfilter Goto Github PK

bloomfilter's Introduction

bloomfilter.py

Description

Usage

Highlights

Number of bits and false positive rate

Interesting Items

Unit Tests

Interesting References

DevOps

bloomfilter note

bloomfilter's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent