Wednesday, February 5, 2020

Gaussian Mixture Model | EM Algorithm | Python Implementation

In the area of Machine Learning, we can distinguish it in two main areas: Supervised and Unsupervised Learning. The main difference between both lies in the nature of the data as well as the approaches to deal with it. Clustering is an unsupervised learning problem where we intend to find clusters of points in our dataset that share some common characteristics. In clustering our job is to find sets of points that appear close together.


Gaussian Distribution

The most commonly used distribution over real numbers in the normal distribution also known as Gaussian Distribution.

The two parameter μ ϵ R and σ ϵ (0, ∞) control the normal distribution. The parameter μ gives the coordinate of the central peak. This is also the mean of the distribution: E[x] = μ. The standard deviation of the distribution is given by σ and the variance by σ2.


When we evaluate the PDF we need to square and invert σ. When we need to frequently evaluate the PDF with different parameter values, a more efficient way of parametrizing the distribution is to use a parameter β ϵ (0,∞) to control the precision or inverse variance, of the distribution:


Normal distribution are a sensible choice for many applications. In the absence of prior knowledge about what form a distribution over the real numbers should take, the normal distribution is a good default choice.


Gaussian Mixture Model

It's a probability distribution consists of multiple probability distributions.

For d dimensions, the gaussian distribution of a vector x = (x1,x2,....xd)T is defined by:




where μ is the mean and Σ is the covariance matrix of the Gaussian.

Covariance is the measure of how changes in one variable and associated with changes in second variable. Specifically, covariance measures the degree to which two variables are linearly associated. However, it is also often used informally as a general measure of how monotonically related two variables are.


Variance-Covariance Matrix

Variance and covariance matrix are often displayed together in a variance-covariance matrix. The variances appear along the diagonal and covariances appear in the off-diagonal elements, as shown:




where 

V is a c x c variance-covariance matrix
N is the number of scores in each of the c data sets
xi is a deviation score from the ith data set
Σ xi2 / N is the variance of elements from the ith data set
Σ xi xj / N is the covariance for elements from the ith and jth data sets.

The probability given in a mixture of K Gaussians is:



where wj is the prior probability (weight) of the jth Gaussian.



Difference between Kmeans and Gaussian Mixture.

Kmeans: find k to minimize (x - μk)2.

Gaussian Mixture (EM clustering): find k to minimize (x - μk)2 / σ2.

The difference (mathematically) is the denominator "&sigma2", which means GM takes variance into consideration when it calculates the measurement.

Kmeans only calculates conventional Euclidean distance.

In other words, Kmeans calculate distance, while GM calculates "weighted" distance.


How is it Optimized?

One of the most popular approaches to maximize the likelihood is to use the Expectation-Maximization (EM) algorithm.

Basic ideas of the EM algorithm:

Introduce a hidden variable such that its knowledge would simplify the maximization of likelihood.

At each iteration:

  • E-Step: Estimate the distribution of the hidden variable given the data and the current value of the parameters.
  • M-Step: Maximize the joint distribution of the data and the hidden variable.

Compare with Gradient Descent

You can obtain maximum likelihood estimates using different methods and using an optimization algorithm is one of them. On another hand, gradient descent can be also used to maximize functions other than likelihood function.

When should you use it?

Anytime you have unlabeled data and want to classify it. If data is normally distributed. 

EM (Expectation Maximization) Algorithm

Initially randomly assign k cluster centers

Iteratively refine the clusters based on two steps


Python Implementation



Let's import the libraries and the iris dataset that we are going to use:




Now let's implement the Gaussian Density Function. You can use numpy function for that but it's actually interesting to see how things work internally. Now we need to create the function that implements this:






Now we initialize clusters which is an initialization step of GMM.



Expectation step

We should now calculate $\gamma(z_{nk})$. We can achieve this by means of the following expression:




In this, we calculate the denominator as a sum over all terms in the numerator and then assign it to a variable named totals


Maximization Step


Now let's implement the maximization step.








Finally, we have a log-likelihood calculation which is given by:



The Python code is :



Now to learn how to train the model check my GitHub where you find the complete code of this. The link is here.

  • GMM allows us to model more complex data.
  • EM Algorithm is a series of steps to find good parameter estimates when there are latent variables
  • EM Steps:

  1.           Initialize the parameter estimates
  2.        Given the current parameter estimates, find the min. log-likelihood for data + latent variables.
  3.           Given the current data, find the better parameter estimates
  4.           Repeat steps 2&3.

  • GMM works well but you have to guess the number of Gaussians. Kernel Density Estimation does not require that kind of guessing.
I have not written this code, Mostly if I need to use GMM I use sklearn to implement it because of less time complexity.
I used this article to learn the mathematics behind GMM. I have also read research papers to get more clarity on this. This link to that research paper is attached in reference.
But I know the mathematics behind it. This code is written by oconteras309(Oscar Contreras Carrasco).

Hope you find this article useful.

[1] Bishop, Christopher M. Pattern Recognition and Machine Learning (2006) Springer-Verlag Berlin, Heidelberg.
[2] Murphy, Kevin P. Machine Learning: A Probabilistic Perspective (2012) MIT Press, Cambridge, Mass,