Git Product home page Git Product logo

dpbmm's Introduction

Dirichlet process mixture models

This repository implements a Dirichlet process mixture model (dpmm). It assumes the data to be a mixture of multinoulli vectors. In this project I implemented the dpmm assuming a mixture of Gaussians on the data. That project explains much of the workings of the Dirichlet process. So on this readme, we will only discuss the particulars of the multinoulli model and some insights on Gibbs sampling.

What is a Multinoulli and why should we care?

We can think of a multinoulli distribution as a series of Bernoulli distributions. For example, we do quality control of car parts. Then we have questions like "Is the rear light broken yes/no?", "Does the engine make a sound yes/no", "Are the tires at incorrect pressure yes/no?", "is the oil at incorrect level yes/no?" and "does the hood have a scratch yes/no?". Such measurements is just a series of 0's and 1's. Like $x = (0, 1, 0, 0, 1)$. Each question has a probability of answered yes/no. In other words, each variable is Bernoulli distributed. Then we say $x$ is Multinoulli distributed.

An example of a probability vector for a Multinoulli is $\pi = (0.3, 0.9, 0.3, 0.8, 0.1)$. You may read this is as

  • With probability 0.3, the rear light is broken
  • With probability 0.9, the engine makes a strange sound
  • With probability 0.3, the tires are at incorrect temperature
  • With probability 0.8, the oil level is incorrect
  • With probability 0.1, the hood has a scratch

(Note that the probabilities do not sum to 1! Each question associates with a probability of being yes or no. This contrasts with Multinomial distributions, where indeed the probabilities must sum to 1. That would be interpreted as only one question will be yes, which is in contrast with our model.)

How to imagine a mixture model of Multinoulli's?

In a mixture model, we imagine the cars in our garage to have problems in some number of clusters. Each cluster associates with a Multinoulli distribution. In our example, we may observe that cars with broken rear lights, usually also have a scratch. This corresponds to the cluster $\pi = (0.9, 0.1, 0.2, 0.1, 0.8)$. Or the cars with engine trouble also usually have bad oil levels, corresponding to a cluster $\pi = (0.3, 0.9, 0.3, 0.8, 0.1)$

Collapsed Gibbs sampling

So how do we go about working with this model?

The important variable in our model are the cluster assignments. The variable, $z_i$ is an integer that informs us what cluster the point $x_i$ belongs to. For clarity, this picture summarizes a graphical representation of our model (from Murphy 25.2, page 887). graphical model

As we only care about the cluster assignments, we can integrate out the other parameters. That way, the Gibbs sampler only needs to sample the $z_i$. More formally, we can get away by only sampling

$$p(z_i = k|z_{-i}, x, \alpha, \lambda)$$

Gibbs sampling results in samples from a posterior. Now what?

Gibbs sampling results in samples from the posterior distribution over $z$. These samples allow us to calculate properties of this posterior distribution. A main result in the literature on Monte Carlo sampling is that as we have have samples from a distribution, we can use it to estimate expected values of functions of this distribution. Like so

$$ E[f(X)] \approx \frac{1}{N}\sum_n f(x_n) $$

For Markov Chain Monte Carlo, we need to take note that $x_n$ are unbiased and i.i.d. samples from $p(X)$. That has two consequences

  • Burn in: During the burn in period, we omit the samples from the Markov chain. The initial samples depend too much on the choice of the initial conditions of the Markov chain. Therefore, they are not unbiased samples. After this burn in period, the samples from the Markov chain are valid samples from $p(X)$
  • Sample interval: Only samples at some interval will be used. For example, we save every k-th sample and omit the intermittent samples. In our Markov chain, subsequent samples are correlated. So a sample and its successor might be dependent. Remember that we can only approximate $E[f(X)]$ when we have i.i.d. samples. Therefore, we use only the samples at a fixed interval.

Estimate the number of clusters

We will illustrate this sample procedure by estimating the number of active clusters. In line 61 to 75 of main_dpm.py, we track the Monte Carlo samples of measuring the number of clusters. According to the idea of Monte Carlo sampling, the expected number of clusters amounts to just the sample mean of this list.

How the code works

The code generates random data to illustrate the working of our model. This is the generate_dataset() function. You'll notice that the model finds (usually) more clusters than we actually generated. But also mention the number of points assigned to the clusters. Usually, we recognise the generated data in the larger clusters.

As always, I am curious to any comments and questions. Reach me at [email protected]

dpbmm's People

Contributors

robromijnders avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.