Dirichlet process mixture models

This repository implements a Dirichlet process mixture model (dpmm). It assumes the data to be a mixture of multinoulli vectors. In this project I implemented the dpmm assuming a mixture of Gaussians on the data. That project explains much of the workings of the Dirichlet process. So on this readme, we will only discuss the particulars of the multinoulli model and some insights on Gibbs sampling.

What is a Multinoulli and why should we care?

We can think of a multinoulli distribution as a series of Bernoulli distributions. For example, we do quality control of car parts. Then we have questions like "Is the rear light broken yes/no?", "Does the engine make a sound yes/no", "Are the tires at incorrect pressure yes/no?", "is the oil at incorrect level yes/no?" and "does the hood have a scratch yes/no?". Such measurements is just a series of 0's and 1's. Like . Each question has a probability of answered yes/no. In other words, each variable is Bernoulli distributed. Then we say is Multinoulli distributed.

An example of a probability vector for a Multinoulli is $\pi = (0.3, 0.9, 0.3, 0.8, 0.1)$ . You may read this is as

With probability 0.3, the rear light is broken
With probability 0.9, the engine makes a strange sound
With probability 0.3, the tires are at incorrect temperature
With probability 0.8, the oil level is incorrect
With probability 0.1, the hood has a scratch

(Note that the probabilities do not sum to 1! Each question associates with a probability of being yes or no. This contrasts with Multinomial distributions, where indeed the probabilities must sum to 1. That would be interpreted as only one question will be yes, which is in contrast with our model.)

How to imagine a mixture model of Multinoulli's?

In a mixture model, we imagine the cars in our garage to have problems in some number of clusters. Each cluster associates with a Multinoulli distribution. In our example, we may observe that cars with broken rear lights, usually also have a scratch. This corresponds to the cluster $\pi = (0.9, 0.1, 0.2, 0.1, 0.8)$ . Or the cars with engine trouble also usually have bad oil levels, corresponding to a cluster $\pi = (0.3, 0.9, 0.3, 0.8, 0.1)$

Collapsed Gibbs sampling

So how do we go about working with this model?

The important variable in our model are the cluster assignments. The variable, is an integer that informs us what cluster the point belongs to. For clarity, this picture summarizes a graphical representation of our model (from Murphy 25.2, page 887).

As we only care about the cluster assignments, we can integrate out the other parameters. That way, the Gibbs sampler only needs to sample the . More formally, we can get away by only sampling

$p(z_i = k|z_{-i}, x, \alpha, \lambda)$

Gibbs sampling results in samples from a posterior. Now what?

Gibbs sampling results in samples from the posterior distribution over . These samples allow us to calculate properties of this posterior distribution. A main result in the literature on Monte Carlo sampling is that as we have have samples from a distribution, we can use it to estimate expected values of functions of this distribution. Like so

$E[f(X)] \approx \frac{1}{N}\sum_n f(x_n)$

For Markov Chain Monte Carlo, we need to take note that are unbiased and i.i.d. samples from . That has two consequences

Burn in: During the burn in period, we omit the samples from the Markov chain. The initial samples depend too much on the choice of the initial conditions of the Markov chain. Therefore, they are not unbiased samples. After this burn in period, the samples from the Markov chain are valid samples from
Sample interval: Only samples at some interval will be used. For example, we save every k-th sample and omit the intermittent samples. In our Markov chain, subsequent samples are correlated. So a sample and its successor might be dependent. Remember that we can only approximate when we have i.i.d. samples. Therefore, we use only the samples at a fixed interval.

Estimate the number of clusters

We will illustrate this sample procedure by estimating the number of active clusters. In line 61 to 75 of main_dpm.py, we track the Monte Carlo samples of measuring the number of clusters. According to the idea of Monte Carlo sampling, the expected number of clusters amounts to just the sample mean of this list.

How the code works

The code generates random data to illustrate the working of our model. This is the generate_dataset() function. You'll notice that the model finds (usually) more clusters than we actually generated. But also mention the number of points assigned to the clusters. Usually, we recognise the generated data in the larger clusters.

As always, I am curious to any comments and questions. Reach me at [email protected]

insilicolife / dpbmm Goto Github PK

dpbmm's Introduction

Dirichlet process mixture models

What is a Multinoulli and why should we care?

How to imagine a mixture model of Multinoulli's?

Collapsed Gibbs sampling

Gibbs sampling results in samples from a posterior. Now what?

Estimate the number of clusters

How the code works

dpbmm's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent