following a variational derivation
Find Maximum Likelihood Estimation (MLE) (or Maximum a posteriori / MAP) when some data is missing.
Given $n$ iid-observations (data)
$$ {\bf X} = (\vec x_1, \vec x_2, \dots \vec x_n) $$Note: $\vec x_i$ can be a sequence.
for some (unknown) $\theta \in \Theta$
Maximization of the likelihood (or log-likelihood) of the parameters given the observations.
$$ \theta_{mle} = \text{arg} \max_\theta p({\bf X} \mid \theta) $$Note: For continuous latent variables the sum over $\vec z_i$ must be replaced by an integral.
$$ \mathcal l(\theta) = p({\bf X} \mid \theta) = \prod_{i=1}^n \left(\int_{\mathcal Z} p(\vec x_i, \vec z_i \mid \theta ) d \vec z_i\right) $$The the log-likelihood is difficult to maximize: the sum is inside the log.
can not be computed, since $\vec z_i$ is unknown.
Here: for each observation $\vec x_i$ we a have a hidden variable $\vec z_i$.
With arbitrary distributions for $q({\bf z}_i)$ over the hidden variables and using Jensen's inequality, we can obtain a lower bound for the log-likelihood:
$$ \begin{align} L(\theta) &= \sum_{i} \log p({\bf x}_i \mid \theta ) \\ & = \sum_{i} \log \int_{\mathcal Z} p({\bf x}_i,{\bf z}_i \mid \theta ) d {\bf z}_i = \\ & = \sum_{i} \log \int_{\mathcal Z} q({\bf z}_i) \frac{p({\bf x}_i, {\bf z_i}\mid \theta )}{q({\bf z}_i)} d {\bf z}_i \\ & \geq \sum_{i} \int_{\mathcal Z} q({\bf z}_i) \log \frac{p({\bf x}_i, {\bf z}_i \mid \theta )}{q({\bf z}_i)} d {\bf z}_i \\ & = \sum_{i} \mathbb E_{q({\bf z}_i)} \left[\log \frac{p({\bf x}_i, {\bf z}_i \mid \theta )}{q({\bf z}_i)}\right] \\ & = - \sum_{i} \mathcal D_{KL} \left( q({\bf z}_i) \mid \mid p({\bf x}_i, {\bf z}_i \mid \theta ) \right) \\ & = \mathcal L(q, \theta) \end{align} $$The Gap between the log-likelihood $L(\theta) $ and the lower bound $\mathcal L$ is
$$ \begin{align} \text{GAP} &= L(\theta) - \mathcal L (q, \theta)\\ &= \sum_{i} \log p({\bf x}_i \mid \theta ) - \sum_{i} \int_{\mathcal Z} q({\bf z_i}) \log \frac{p({\bf x}_i, {\bf z_i} \mid \theta )}{q({\bf z}_i)} d {\bf z}_i\\ &= \sum_{i} \log p({\bf x}_i \mid \theta )\int_{\mathcal Z} q({\bf z}_i) d{\bf z}_i - \sum_{i} \int_{\mathcal Z} q({\bf z}_i) \log \frac{p({\bf x}_i, {\bf z}_i \mid \theta )}{q({\bf z}_i)} d {\bf z}_i\\ &= \sum_{i} \left( \int_{\mathcal Z} q({\bf z}_i) \log p({\bf x}_i \mid \theta ) d{\bf z} - \int_{\mathcal Z} q({\bf z}_i) \log \frac{p({\bf x}_i, {\bf z}_i \mid \theta )}{q({\bf z}_i)} d {\bf z}_i \right)\\ &= \sum_{i} \left( \int_{\mathcal Z} \left( q({\bf z}_i) \log p({\bf x}_i \mid \theta ) - q({\bf z}_i) \log \frac{p({\bf x}_i, {\bf z}_i \mid \theta )}{q({\bf z}_i)} \right) d {\bf z}_i \right)\\ &= \sum_{i} \left( \int_{\mathcal Z} \left( q({\bf z}_i) \log \frac{p({\bf x}_i \mid \theta ) q({\bf z}_i) }{p({\bf x}_i, {\bf z}_i \mid \theta )} \right) d {\bf z}_i \right)\\ &= \sum_{i} \left( \int_{\mathcal Z} \left( q({\bf z}_i) \log \frac{p({\bf x}_i \mid \theta ) q({\bf z_i}) }{p({\bf z_i} \mid {\bf x_i}, \theta )p({\bf x_i}\mid \theta)} \right) d {\bf z}_i \right)\\ &= \sum_{_i} \left( \int_{\mathcal Z} \left( q({\bf z}_i) \log \frac{ q({\bf z}_i) }{p({\bf z}_i \mid {\bf x_i}, \theta )} \right) d {\bf z}_i \right)\\ &= \sum_{_i} \mathcal D_{KL}\left( q({\bf z}_i) \mid \mid p({\bf z}_i \mid {\bf x_i}, \theta )\right ) \end{align} $$with the Kullback-Leiber divergence $\mathcal D_{KL}$.
We want to minimize the GAP (equals to maximizing the lower bound).
The Kullback-Leiber divergence $\mathcal D_{KL}$ is zero iff $q({\bf z}_i) = p({\bf z}_i \mid {\bf x_i}, \theta )$.
So, in the E-Step we set the variational distribution $q({\bf z}_i)$ to $p({\bf z}_i \mid {\bf x_i}, \theta )$ (if we can compute $p({\bf z}_i \mid {\bf x_i}, \theta )$)
Note: For each $i$ we have a (in general different) $q({\bf z}_i)$, i.e. we used the usual "abuse of notation": the variable ${\bf z}_i $ of the function $q({\bf z}_i)$ is used to identify the distribution.
Note: If ${\bf z}_i$ can take only a bunch of possible values, we can compute $p({\bf z}_i \mid {\bf x_i}, \theta^{(t)} )$ by using Bayes rule:
$$ p({\bf z}_i \mid {\bf x_i}, \theta )=\frac{p({\bf x}_i \mid {\bf z}_i, \theta) p({\bf z}_i )}{\sum_{{\bf z}'_i}p({\bf x}_i \mid {\bf z'}_i, \theta) p({\bf z'}_i)} $$Note: We have to maximize the expected complete data log-likelihood (expectation w.r.t. $q({\bf z}_i)$), i.e. the complete data log-likelihood by using our current guess of the distribution $q^{(t+1)}({\bf z}_i) = p({\bf z}_i \mid {\bf x}_i, \theta^{(t)} )$ (see E-step).
Lower bound:
$$ \log p(X \mid \theta^{(t+1)}) \geq \mathcal L(q^{(t+1)}, \theta^{(t+1)}) $$and because of the optimization procedure (E-Step and M-Step) the following inequalities holds:
$$ \mathcal L(q^{(t+1)}, \theta^{(t+1)})\geq \mathcal L(q^{(t+1)}, \theta^{(t)}) \geq \mathcal L(q^{(t)}, \theta^{(t)}) $$and for $q({\bf z}_i) = p({\bf z}_i \mid {\bf x_i}, \theta )$ $$ \mathcal L(q^{(t)}, \theta^{(t)}) = \log p(X \mid \theta^{(t)}) $$
Therefore $$ \log p(X \mid \theta^{(t+1)}) \geq \log p(X \mid \theta^{(t)}) $$
So is guaranteed that in each iteration the log likelihood increases (until convergence).