Git Product home page Git Product logo

similarity-search-clustering's Introduction

Similarity-Search-Clustering

Efficient top-K Vector Similarity Search Queries on MNIST Images Dataset using custom Locality Sensitive Hashing ๐Ÿ”ข

Suppose we have two sets of entities represented by vectors (images in our case), namely a query-set and a train-set. We define our problem as the task of finding for each image of the query-set, the $k$ most similar images in the train-set. One can easily infer that such an endeavour becomes prohibitively slow, when we increase the number of query or train entities. For example, if the total number of images in both sets is $N$ and both sets are equal in size, we have to deal with $O(N^2)$ complexity.

Algorithms:

In order to produce results in a time efficient manner, we have to employ probabilistic methods that group similar items (clustering) and allow for comparisons within smaller groups of similar items with high probability. We implement the following novel methods:

  • Locality-Sensitive-Hashing - random vectors following normal distribution are chosen from the hyperspace defined by the images. These constitute borders for subspaces of the initial space. Those areas are populated by the train images. For each query image, its target subspace is infered and it is compared to its neighbours within the area, resulting in similarity search with a smaller part of the initial train set.

  • Hypercube-Random-Projection - a novel method using an LSH family of functions for Randomized-Projections of vectors' points to a Hamming Cube of $log(n)$ dimensionality, where $n$ is the number of pixels of the images. Examines points which are assigned to the same or nearby vertices on the Hamming cube.

  • Clustering - in the simple version the centroids are chosen in the images' hyperplane using a probabilistic algorithm that maximizes the chance for their even distribution. Images are distributed to the corresponding clusters using Lloyd's Algorithm, then the centroids are updated using K-Medians. We develop two novel clustering methods:

    โ€ƒโ€ƒ 1. LSH Range Searching - emulates the iterative increase of the range within we search for neighbours of the centroids, in each
    โ€ƒโ€ƒ iteration points are assigned to the nearest centroids

    โ€ƒโ€ƒ 2. Hypercube Range Searching - similar to the upper approach but with the usage of Hamming Hypercubes

The clusters are being updated till centroids' deviation from previous point gets into a specified range. The quality of the produced clusters is examined through Silhouette Evaluation.

Dataset:

We use the famous MNIST Dataset. It consists of 60.000 images, consisting of 784 Pixels represented with a number in range [0,255].

  • Training Dataset : train-images-idx3-ubyte - contains all the images of the MNIST Dataset
  • Query Dataset : t10k-images-idx3-ubyte - a subset of the initial dataset with 10.000 images in total

Similarity:

Once we have identified the candidates in the training set, we have to infer how similar they are to the query image. We do that by comparing the two vectors using a specific similarity metric. In the case of LSH we use the Manhattan Distance and Hamming Distance for Hypercube. When we execute a Range Searching approach, we still use those two metrics to infer similarity within the given range.

Compilation & Execution:

In order to compile the code, you simply open the command line and enter: make. The compilation produces 3 different executable names with distinct names, each one for the methods we described above. You can execute the implemented algorithms, as follows:

  • LSH:

./lsh -d <input file> -q <query file> -k <number of hastables> -L <number of projection functions> -ฮฟ <output file> -ฮ <number of nearest neighbours> -R <radius>

  • Hypercube:

./cube -d <input file> -q <query file> -k <hypercube dimensions> -M <maximum check points> -probes <number of probes> -ฮฟ <output file> -ฮ <number of nearest neighbours> -R <radius>

  • Clustering:

./cluster -i <input file> -c <configuration file> -o <output file> -complete <optional> -m <clustering method: Classic / LSH / HyperCube>

Further informations

We include a readME PDF file that contains an extensive description of the project set up, the inner workings of the respective algorithms and the design choices of the creators, namely us. Also, we include a description of the contents of each source file. This PDF is written in Greek!

Big thanks to my fellow collaborator: Anastasios Melidonis

Built as part of the course: Software Development for Algorithmic Problems , Winter of 2020. University of Athens, DiT.

similarity-search-clustering's People

Contributors

jacobmaciejewski avatar

Watchers

 avatar

Forkers

anastasios084

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.