## Expectation-Maximization¶

#### Goal¶

Find Maximum Likelihood Estimation (MLE) (or Maximum a posteriori / MAP) when some data is missing.

Given $n$ iid-observations (data)

$${\bf X} = (\vec x_1, \vec x_2, \dots \vec x_n)$$

Note: $\vec x_i$ can be a sequence.

#### Model¶

$$(\vec x, \vec z) \sim p_\theta$$ for some (unknown) $\theta \in \Theta$

Typically $p_\theta$ is in the exponential family.

• $\vec z$: hidden (latent) random variables

#### Goal¶

Maximization of the likelihood (or log likelihood) of the parameters given the observations.

$$\theta_{mle} = \text{arg} \max_\theta p({\bf X} \mid \theta)$$

#### Marginal likelihood of the parameters w.r.t. the observed data:¶

$$\mathcal l(\theta) = p({\bf X} \mid \theta) = \prod_{i=1}^n \left(\sum_{\vec z_i} p(\vec x_i, \vec z_i \mid \theta )\right)$$

Note: For continuous latent variables the sum over $\vec z_i$ must be replaced by an integral.

#### Marginal log likelihood of the parameters w.r.t. the observed data¶

$$\mathcal L(\theta) = \log p({\bf X} \mid \theta) = \sum_{i=1}^n \left(\log \sum_{\vec z_i} p(\vec x_i, \vec z_i \mid \theta )\right)$$

The the log-likelihood is difficult to maximize: the sum is inside the log.

#### Complete data log likelihood:¶

$$\mathcal L_c(\theta) = \log p({\bf X, Z} \mid \theta) = \sum_{i=1}^n \log p(\vec x_i, \vec z_i \mid \theta )$$

can not be computed, since $\vec z_i$ is unknown.

#### Auxilary function:¶

So we introduce the expected complete data log likelihood as an auxilary function $\mathcal Q(\theta , \theta^{(t)})$

$$\mathcal Q(\theta , \theta^{(t)}) = \mathbb E \left[ \mathcal L_c(\theta) \mid {\bf X}; \theta^{(t)} \right] = \sum_{\bf Z} p({\bf Z} \mid {\bf X}, \theta^{(t)}) \log p({\bf X, Z} \mid \theta) = \sum_{i=1}^n \sum_{\vec z_i}p(\vec z_i \mid \vec x_i, \theta^{(t)}) \log p(\vec x_i, \vec z_i \mid \theta )$$

#### EM Algorithm¶

Initialize $\theta^{(0)} \in \Theta$
For $t = 0,1,2, \dots$ (until convergence) :

• E-Step (Expectation): $$\mathcal Q(\theta ,\theta^{(t)}) = \mathbb E \left[ \mathcal L_c(\theta) \mid {\bf X}; \theta_t \right]$$ Here typically values analog to the "weights" $p(\vec z_i \mid \vec x_i, \theta^{(t)})$) for the expectation of the log-likelihood are computed based on the current parameters $\theta^{(t)}$. $p(\vec z_i \mid \theta^{(t)})$ is typically also considered as a parameter and computed in the M-step.

• M-Step (Maximization):
$$\theta^{(t+1)} = \text{arg} \max_\theta \mathcal Q(\theta , \theta^{(t)})$$

Note:

In the E-Step $\mathcal Q$ is not typically computed directly. Instead terms inside of it are computed of which the MLE depends on.
These are the sufficient statistics.

### MAP¶

for MAP is the M-Step (Maximization):
$$\theta^{(t+1)} = \text{arg} \max_\theta \mathcal Q(\theta , \theta^{(t)}) + \log p(\theta)$$

### Variational Methods and EM¶

for a different view on the EM-Algorithm see EM - variational derivation.

### Example: Mixture Models / Gaussians Mixture Models¶

#### Mixture Model for the data probability distribution:¶

 z -> x



$$p(\vec x, z) = p(\vec x \mid z) p(z)$$

with the discrete latent state $z \in \{1,2, \dots K\}$

using $z_k$ for a special latent state

$$p(\vec x, z_k) = p(\vec x \mid z_k) p(z_k)$$

or (using $k$ as $z_k$)

$$p(\vec x \mid z) = \prod_{k=1}^K p(\vec x \mid z_k)^{\mathbb 1\left(k=z\right)}$$

• The indicator function $\mathbb 1\left(k=z\right)$ selects one of the $K$ the base distributions

and for the join $$p(\vec x, z) = \prod_{k=1}^K \pi_{k} p(\vec x \mid z_k)^{\mathbb 1\left(k=z\right)}$$ with the mixing coeffients $p(z_k)=\pi_k$:

• $\pi_k\geq0$
• $\sum_{k=1}^K \pi_k=1$

#### Gaussian Mixture Model (Mixture of Gaussian, GMM)¶

Typically we use for the $p(\vec x \mid z_k)$ base functions with parameters $\theta_k$.

Here we write a little bit sloppy (the latent $z_k$ selects implicitly the corresponding $\theta_k$):

$$p(\vec x \mid z_k) = p(\vec x \mid \theta_k)$$

For GMMs we use Gaussian base distributions

$$p(\vec x \mid z_k) = \mathcal N(\vec x \mid \vec \mu_{k}, \Sigma_{k})$$

• $\mathcal N(\vec x \mid \vec \mu_{k}, \Sigma_{k})$: Gaussian distribution with mean vector $\vec \mu_k$ and covariance matrix $\Sigma_k$

or (using $k$ as $z_k$)

$$p(\vec x \mid z) = \prod_{k=1}^K \mathcal N(\vec x \mid \vec \mu_{k}, \Sigma_{k})^{\mathbb 1\left(k=z\right)}$$

and for the join $$p(\vec x, z) = \prod_{k=1}^K \pi_{k} \mathcal N(\vec x \mid \vec \mu_{k}, \Sigma_{k})^{\mathbb 1\left(k=z\right)}$$

#### (Data) Marginal¶

$$p(\vec x) = \sum_{z_k=1}^{z_K} p(z_k) p(\vec x \mid z_k) = \sum_{k=1}^{K} \pi_k p(\vec x \mid \theta_k)$$

for GMM

$$p(\vec x) = \sum_{k=1}^K \pi_k \mathcal N(\vec x \mid \vec \mu_k, \Sigma _k)$$

#### Likelihood of one data point $\vec x_i$:¶

$$p(\vec x_i\mid \theta) = \sum_{k=1}^K p(k \mid \theta) p(\vec x_i \mid \theta, k) = \sum_{k=1}^K \pi_k p(\vec x_i \mid \theta _k)$$

with the parameters

$\theta = \{\pi_1, \pi_2, \dots, \pi_K, \theta_1, \theta_2, \dots, \theta_K \}$

i.e. we consider the mixing coefficients $\pi_k$ as parameters.

for GMM:

$$p(\vec x_i\mid \theta) = \sum_{k=1}^K \pi_k \mathcal N(\vec x_i \mid \vec \mu_k, \Sigma _k)$$

with the parameters $\theta = \{\pi_1, \pi_2, \dots, \pi_K, (\vec \mu_1, \Sigma_1), (\vec \mu_2, \Sigma_2), \dots, (\vec \mu_K, \Sigma_K) \}$

#### Expected complete log likelihood¶

\begin{align} \mathcal Q(\theta, \theta^{(t)}) & = \mathbb E \left[ \mathcal L_c(\theta) \mid {\bf X}; \theta_t \right] \\ & = \mathbb E \left[ \sum_{i=1}^n \log p(\vec x_i, z_i \mid \theta) \right] \\ & = \sum_{i=1}^n \mathbb E \left[ \log p(\vec x_i, z_i \mid \theta) \right] \\ & = \sum_{i=1}^n \mathbb E \left[ \log \left( \prod_{k=1}^K \left(\pi_k p(\vec x_i \mid \theta _k) \right)^{\mathbb 1(z_i=k)}\right) \right]\\ & = \sum_{i=1}^n \sum_{k=1}^K\mathbb E \left[ {\mathbb 1(z_i=k)} \log \left( \pi_k p(\vec x_i \mid \theta _k) \right) \right]\\ & = \sum_{i=1}^n \sum_{k=1}^K \sum_{z_i=1}^K \left( {\mathbb 1(z_i=k)} p(z_i\mid x_i, \theta^{(t)})\log \left( \pi_k p(\vec x_i \mid \theta _k) \right) \right)\\ & = \sum_{i=1}^n \sum_{k=1}^K \left( {\mathbb 1(z_i=k)} p(z_i\mid x_i, \theta^{(t)} ) \log \left( \pi_k p(\vec x_i \mid \theta _k) \right) \right)\\ & = \sum_{i=1}^n \sum_{k=1}^K \left( r_{ik} \log \left( \pi_k p(\vec x_i \mid \theta _k) \right) \right)\\ & = \sum_{i=1}^n \sum_{k=1}^K r_{ik} \log \pi_k + \sum_{i=1}^n \sum_{k=1}^K r_{ik} \log p(\vec x_i \mid \theta _k)\\ \end{align}

with the responsibility $r_{ik}$ that cluster $k$ is responsible for data point $\vec x_i$:

$$r_{ik} = {\mathbb 1(z_i=k)} p(z_i\mid \vec x_i, \theta^{(t)})$$

• $r_{ik} > 0$
• $\sum_{k=1}^K r_{ik} = 1$

#### E-Step¶

In the E-Step we compute the responsability $r_{ik} = \mathbb 1(z_i=k) p(z_i\mid \vec x_i, \theta^{(t)})$

with
$$p(z_i \mid \vec x_i, \theta^{(t)}) = \frac{p(z_i,\vec x_i \mid \theta^{(t)})}{p(\vec x_i\mid \theta^{(t)})} = \frac{p(z_i\mid \theta^{(t)})p(\vec x_i \mid z_i, \theta^{(t)})}{\sum_{z_i=1}^K p(z_i, \vec x_i \mid \theta^{(t)})} = \frac{p(z_i\mid \theta^{(t)})p(\vec x_i \mid z_i, \theta^{(t)})}{\sum_{z_i=1}^K p(z_i\mid \theta^{(t)})p(\vec x_i \mid z_i, \theta^{(t)})}$$

and

• $\mathbb 1(z_i=k) p(z_i \mid \theta^{(t)}) = \pi_k^{(t)}$
• $\mathbb 1(z_i=k) p(\vec x_i \mid z_i, \theta^{(t)}) = p(\vec x_i \mid \theta^{(t)}_k)$

the dependence of $r_{ik}$ on the current parameter estimates $\theta^{(t)} = \{\pi_1^{(t)}, \pi_2^{(t)}, \dots, \pi_K^{(t)}, \theta_1^{(t)}, \theta_2^{(t)}, \dots, \theta_K^{(t)} \}$ is

\begin{align} r_{ik} & = {\mathbb 1(z_i=k)} p(z_i\mid \vec x_i, \theta^{(t)}) \\ & = \frac{\pi_k^{(t)}p(\vec x_i \mid \theta^{(t)}_k)}{\sum_{k=1}^K \pi_k^{(t)}p(\vec x_i \mid \theta^{(t)}_k)} \end{align}

that holds for all mixture models.

remember for GMMs: $$p(\vec x_i \mid \theta^{(t)}_k) = \mathcal N(\vec x \mid \vec \mu^{(t)}_k, \Sigma^{(t)}_k)$$

so $r_{ij}$ can easyly computed.

In :
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

In :
# Data
cov_1 = np.array([[0.6,0.2],[0.2, 0.3]])
mean_1 = np.array([2.6,2.2])
cov_2 = np.array([[0.6,-0.55],[-0.55, 0.6]])
mean_2 = np.array([-.9,1.7])
cov_3 = np.array([[.1, 0.01],[0.01, 4.3]])
mean_3 = np.array([.9,5.5])
nb_1 = 40
nb_2 = 60
nb_3 = 50
X_1 = np.random.multivariate_normal(mean_1, cov_1, size=nb_1)
X_2 = np.random.multivariate_normal(mean_2, cov_2, size=nb_2)
X_3 = np.random.multivariate_normal(mean_3, cov_3, size=nb_3)
X_ = np.concatenate((X_1,X_2,X_3))
X = np.random.permutation(X_)
labels = np.zeros(len(X))
labels[nb_1:nb_2+nb_1] = 1
labels[nb_1+nb_2:] = 2

In :
plt.figure(figsize=(5,5))
plt.scatter(X_[:, 0], X_[:, 1], s=50, c=labels, cmap='viridis')
plt.xlabel("$x_1$")
plt.xlim(-3,5)
plt.ylim(-1,10)
_ = plt.ylabel("$x_2$") In :
#np.random.multivariate_normal?

In :
from scipy.stats import multivariate_normal

In :
means = np.random.uniform(low=-2,high=3, size=(3,2))

In :
# multivariate_normal.pdf(X, mean=mean_1, cov=cov_1).shape
#np.random.uniform?

In :
from scipy import random, linalg
# TODO: better random cov (negative of-diagonal element)
def get_random_cov(size=2, s=1.5):
m = random.rand(size,size)*s
return np.dot(m, m.T)
covs = np.array([get_random_cov() for i in range(3)])

In :
pi = np.array([0.33,0.33,0.34])

In :
means.shape
n = X.shape
K = means.shape
p_x_i__theta_k = np.ndarray([n, K])
for i, (mean, cov) in enumerate(zip(means, covs)):
p_x_i__theta_k[:,i] = multivariate_normal.pdf(X, mean=mean, cov=cov)

#pi_ks * p_x_i__theta_k

In :
def responsibility(means, covs, pi, X):
# returns a matrix r with indicies i and k

n = X.shape
K = means.shape

p_x_i__theta_k = np.ndarray([n, K])
for i, (mean, cov) in enumerate(zip(means, covs)):
p_x_i__theta_k[:,i] = multivariate_normal.pdf(X, mean=mean, cov=cov)

r = pi * p_x_i__theta_k
r = (r.T / r.sum(axis=1)).T
return r

In :
r = responsibility(means, covs, pi, X)


#### M-Step¶

Remember the M-Step: $$\theta^{(t+1)} = \text{arg} \max_\theta \mathcal Q(\theta , \theta^{(t)})$$

i.e. maximization of $\mathcal Q(\theta , \theta^{(t)})$ with respect to $\theta = \{\pi_1, \pi_2, \dots, \pi_K, (\vec \mu_1, \Sigma_1), (\vec \mu_2, \Sigma_2), \dots, (\vec \mu_K, \Sigma_K) \}$

$$\mathcal Q(\theta, \theta_t) = \sum_{i=1}^n \sum_{k=1}^K r_{ik} \log \pi_k + \sum_{i=1}^n \sum_{k=1}^K r_{ik} \log p(\vec x_i \mid \theta _k)$$

We can maximize the two terms separatly.

#### M-Step for $\pi_k$¶

the first term is $$\sum_{i=1}^n \sum_{k=1}^K r_{ik} \log \pi_k = \sum_{i,k} r_{ik} \log \pi_k$$

so we are searching for the maximum of this term under the constraint $\sum_k{\pi_k}=1$

with the Lagrange multiplier $\lambda$ we have to maximize $$\sum_{i,k} r_{ik} \log \pi_k + \lambda \left(1- \sum_{k=1}^K{\pi_k}\right)$$

setting the derivative equals zero: $$\frac{\partial}{\partial \pi_k} \left(\sum_{i,k} r_{ik} \log \pi_k + \lambda \left(1-\sum_{k=1}^K{\pi_k}\right) \right) = 0$$

$$\frac{\sum_i r_{ki}}{\pi_k} - \lambda = 0$$

or

$$\frac{\sum_i r_{ki}}{\lambda} = \pi_k$$

summation over all $k$ and using the constraint, we get: $$\sum_k \frac{\sum_i r_{ki}}{\lambda} = \sum_k \pi_k = 1$$ Using the definition for the $r_{ik}$ we have for $\lambda$: $$\lambda = \sum_{i,k} r_{ik} = \sum_i \sum_k {\mathbb 1(z_i=k)} p(z_i\mid \vec x_i, \theta^{(t)}) = \sum_i 1 = n$$

this gives with $r_{k} = \sum_{i=1}^n r_{ik}$

$$\pi_k = \frac{r_k}{n}$$

In :
def m_step_pi(r_k, n):
return r_k / n


#### M-Step for $\vec \mu_k$ and $\Sigma_k$¶

In all the computations above we didn't used the fact that the base functions are Gaussians. So they hold also for other mixture models.

TODO: explizite Herleitung

$$\vec \mu = \frac{\sum_i r_{ik} \vec x_i}{r_k}$$

$$\Sigma_k = \frac{\sum_i r_{ik}(\vec x_i - \vec \mu_k) (\vec x_i - \vec \mu_k)^T}{r_k} = \frac{\sum_i r_{ik}\vec x_i \vec x_i^T}{r_k} - \vec \mu_k \vec \mu_k^T$$

In :
# test of XX computation with einsum
n = 150
for i in range(150):
b = np.outer(X[i,:], X[i,:])
np.testing.assert_allclose(a[i],b)

In :
def m_step_mu_sigma(r, r_k, X):
#r_k = r.sum(axis=0)

mu = np.dot(r.T, X)
mu = (mu.T / r_k).T

r_k = r.sum(axis=0)
t1 = (t1.T / r_k).T
sigma = t1 - t2
return mu, sigma

In :
r_k = r.sum(axis=0)
mu, sigma = m_step_mu_sigma(r, r_k, X)

In :
sigma

Out:
array([[[2.06087833, 0.18880427],
[0.18880427, 6.45992082]],

[[0.62648901, 0.40307152],
[0.40307152, 0.44573864]],

[[5.93178835, 0.29021649],
[0.29021649, 0.09886901]]])
In :
# TODO:
# visualizaton from python data science handbook
# remove that!!!!

from matplotlib.patches import Ellipse

def draw_ellipse(position, covariance, ax=None, **kwargs):
"""Draw an ellipse with a given position and covariance"""
ax = ax or plt.gca()

# Convert covariance to principal axes
if covariance.shape == (2, 2):
U, s, Vt = np.linalg.svd(covariance)
angle = np.degrees(np.arctan2(U[1, 0], U[0, 0]))
width, height = 2 * np.sqrt(s)
else:
angle = 0
width, height = 2 * np.sqrt(covariance)

# Draw the Ellipse
for nsig in range(1, 4):
ax.add_patch(Ellipse(position, nsig * width, nsig * height,
angle, **kwargs))

def plot_gmm(means, covars, weights, X, label=False, ax=None):

ax = ax or plt.gca()
if label:
ax.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap='viridis', zorder=2)
else:
ax.scatter(X[:, 0], X[:, 1], s=40, zorder=2)
ax.axis('equal')

w_factor = 0.2 / weights.max()
for pos, covar, w in zip(means, covars, weights):
draw_ellipse(pos, covar, alpha=w * w_factor)

In :
# TODO make numerically stable
def expectation_maximization(X, k=3):
n = X.shape
data_dim = X.shape

# random start values - TODO
pi = np.ones(k)/k
mus = np.random.uniform(low=-2, high=3, size=(k, data_dim))
sigmas = get_random_cov(size=data_dim, s=1.5)

converged = False

while not converged:
# E-Step
r = responsibility(mus, sigmas, pi, X)
r_k = r.sum(axis=0)
# M-Step
pi_, (mus_, sigmas_)  = m_step_pi(r_k, n), m_step_mu_sigma(r, r_k, X)
# stopping criterion
tol = 1e-02
if np.allclose(pi, pi_, rtol=tol) and np.allclose(mus, mus_, rtol=tol) and np.allclose(sigmas, sigmas_, rtol=tol):
converged = True
pi, mus, sigmas = pi_, mus_, sigmas_
return pi, mus, sigmas

In :
while True:
try:
pi, mus, sigmas = expectation_maximization(X)
break
except ValueError:
pass

In :
plot_gmm(mus, sigmas, pi, X, label=False, ax=None) #### Literature¶

• K. Murphy: Machine Learning - A Probabilistic Perspective, MIT Press, 2012
• [Row99] Sam Roweis, Zoubin Ghahramani: A Unified View of Linear Gaussian Models, Neural Computation, 11(2), 305-45, 1999
• [Rad93] Radford M. Neal and Geoffrey E. Hinton: A New View of the EM Algorithm that Justifies Incremental and Other Variants, Learning in Graphical Models, 355-368, Kluwer Academic Publishers, 1993