In the area of Machine Learning, we can distinguish it in two main areas: Supervised and Unsupervised Learning. The main difference between both lies in the nature of the data as well as the approaches to deal with it. Clustering is an unsupervised learning problem where we intend to find clusters of points in our dataset that share some common characteristics. In clustering our job is to find sets of points that appear close together.

## Gaussian Distribution

The most commonly used distribution over real numbers in the normal distribution also known as Gaussian Distribution.

The two parameter μ ϵ R and σ ϵ (0, ∞) control the normal distribution. The parameter μ gives the coordinate of the central peak. This is also the mean of the distribution: E[x] = μ. The standard deviation of the distribution is given by σ and the variance by σ

When we evaluate the PDF we need to square and invert σ. When we need to frequently evaluate the PDF with different parameter values, a more efficient way of parametrizing the distribution is to use a parameter β ϵ (0,∞) to control the precision or inverse variance, of the distribution:

Normal distribution are a sensible choice for many applications. In the absence of prior knowledge about what form a distribution over the real numbers should take, the normal distribution is a good default choice.

^{2}.When we evaluate the PDF we need to square and invert σ. When we need to frequently evaluate the PDF with different parameter values, a more efficient way of parametrizing the distribution is to use a parameter β ϵ (0,∞) to control the precision or inverse variance, of the distribution:

Normal distribution are a sensible choice for many applications. In the absence of prior knowledge about what form a distribution over the real numbers should take, the normal distribution is a good default choice.

## Gaussian Mixture Model

It's a probability distribution consists of multiple probability distributions.

For d dimensions, the gaussian distribution of a vector x = (x

where μ is the mean and Σ is the covariance matrix of the Gaussian.

Covariance is the measure of how changes in one variable and associated with changes in second variable. Specifically, covariance measures the degree to which two variables are linearly associated. However, it is also often used informally as a general measure of how monotonically related two variables are.

^{1},x^{2},....x^{d})^{T}is defined by:where μ is the mean and Σ is the covariance matrix of the Gaussian.

Covariance is the measure of how changes in one variable and associated with changes in second variable. Specifically, covariance measures the degree to which two variables are linearly associated. However, it is also often used informally as a general measure of how monotonically related two variables are.

## Variance-Covariance Matrix

Variance and covariance matrix are often displayed together in a variance-covariance matrix. The variances appear along the diagonal and covariances appear in the off-diagonal elements, as shown:

where

V is a c x c variance-covariance matrix

N is the number of scores in each of the c data sets

x

Σ x

Σ x

The probability given in a mixture of K Gaussians is:

where w

Kmeans: find k to minimize (x - μ

Gaussian Mixture (EM clustering): find k to minimize (x - μ

The difference (mathematically) is the denominator "&sigma

Kmeans only calculates conventional Euclidean distance.

In other words, Kmeans calculate distance, while GM calculates "weighted" distance.

_{i}is a deviation score from the ith data setΣ x

_{i}^{2}/ N is the variance of elements from the ith data setΣ x

_{i}x_{j}/ N is the covariance for elements from the ith and jth data sets.The probability given in a mixture of K Gaussians is:

where w

_{j}is the prior probability (weight) of the jth Gaussian.**Difference between Kmeans and Gaussian Mixture.**Kmeans: find k to minimize (x - μ

_{k})^{2}.Gaussian Mixture (EM clustering): find k to minimize (x - μ

_{k})^{2}/ σ^{2}.The difference (mathematically) is the denominator "&sigma

^{2}", which means GM takes variance into consideration when it calculates the measurement.Kmeans only calculates conventional Euclidean distance.

In other words, Kmeans calculate distance, while GM calculates "weighted" distance.

## How is it Optimized?

One of the most popular approaches to maximize the likelihood is to use the Expectation-Maximization (EM) algorithm.

**Basic ideas of the EM algorithm:**

Introduce a hidden variable such that its knowledge would simplify the maximization of likelihood.

At each iteration:

**E-Step:****Estimate**the distribution of the hidden variable given the data and the current value of the parameters.**M-Step:****Maximize**the joint distribution of the data and the hidden variable.

**Compare with Gradient Descent**

You can obtain maximum likelihood estimates using different methods and using an optimization algorithm is one of them. On another hand, gradient descent can be also used to maximize functions other than likelihood function.

## When should you use it?

Anytime you have unlabeled data and want to classify it. If data is normally distributed.

Iteratively refine the clusters based on two steps

**EM (Expectation Maximization) Algorithm****Initially randomly assign k cluster centers**

Iteratively refine the clusters based on two steps

## Python Implementation

Let's import the libraries and the iris dataset that we are going to use:

Now let's implement the Gaussian Density Function. You can use numpy function for that but it's actually interesting to see how things work internally. Now we need to create the function that implements this:

Now we initialize clusters which is an initialization step of GMM.

We should now calculate . We can achieve this by means of the following expression:

In this, we calculate the denominator as a sum over all terms in the numerator and then assign it to a variable named

Now let's implement the Gaussian Density Function. You can use numpy function for that but it's actually interesting to see how things work internally. Now we need to create the function that implements this:

Now we initialize clusters which is an initialization step of GMM.

**Expectation step**We should now calculate . We can achieve this by means of the following expression:

In this, we calculate the denominator as a sum over all terms in the numerator and then assign it to a variable named

**totals**

**Maximization Step**

Now let's implement the maximization step.

Finally, we have a log-likelihood calculation which is given by:

The Python code is :

Now to learn how to train the model check my GitHub where you find the complete code of this. The link is

**here**.- GMM allows us to model more complex data.
- EM Algorithm is a series of steps to find good parameter estimates when there are latent variables
- EM Steps:

- Initialize the parameter estimates
- Given the current parameter estimates, find the min. log-likelihood for data + latent variables.
- Given the current data, find the better parameter estimates
- Repeat steps 2&3.

- GMM works well but you have to guess the number of Gaussians. Kernel Density Estimation does not require that kind of guessing.

I have not written this code, Mostly if I need to use GMM I use sklearn to implement it because of less time complexity.

I used

**this**article to learn the mathematics behind GMM. I have also read research papers to get more clarity on this. This link to that research paper is attached in reference.
But I know the mathematics behind it. This code is written by oconteras309(Oscar Contreras Carrasco).

Hope you find this article useful.

[1] Bishop, Christopher M.

*Pattern Recognition and Machine Learning*(2006) Springer-Verlag Berlin, Heidelberg.
[2] Murphy, Kevin P.

*Machine Learning: A Probabilistic Perspective*(2012)*MIT Press, Cambridge, Mass,*