Generative Models: Variational Autoencoders
Generative models are a class of statistical models that aim to learn the underlying data distribution from a given dataset. These models provide a way to generate new samples that are statistically similar to the training data. They have gained substantial attention in various domains, such as image generation, speech synthesis, and even drug discovery.
Generative Model
Generative models are a class of statistical models that aim to learn the underlying data distribution. Given a dataset of observed samples, one starts by selecting a distributional model parameterized by $(\theta)$. The objective is to estimate $(\theta)$ such that it aligns optimally with the observed samples.The anticipation is that it can also generalize to samples outside the training set.
The optimal distribution is hence the one that maximizes the likelihood of producing the observed data, giving lower probabilities to infrequent observations and higher probabilities to the more common ones (the principle underlying this assumption is that ’the world is a boring place’ in words of Bhiksha Raj).
The Challenge of Maximum Likelihood Estimates (MLE) for Unseen Observations
When training generative models, a natural objective is to optimize the model parameters such that the likelihood of the observed data under the model is maximized. This method is known as Maximum Likelihood Estimation (MLE). In mathematical terms, given observed data $X$, the MLE seeks parameters $\theta$ that maximize:
$$p_\theta(X)$$
However, for many generative models, especially those that involve latent or unobserved variables, the likelihood term involves summing or integrating over all possible configurations of these latent variables. Mathematically, this turns into:
$$p_\theta(X) = \sum_{Z} p_\theta(X,Z)$$ $$or$$ $$p_\theta(X) = \int p_\theta(X,Z) dZ$$
Computing the loglikelihood, which is often used for numerical stability and optimization ease, leads to a log of summations (for discrete latent variables) or a log of integrals (for continuous latent variables):
$$log p_\theta(X) = \log \sum_{Z} p_\theta(X,Z)$$ $$or$$ $$log p_\theta(X) = \log \int p_\theta(X,Z) dZ$$
These expressions are typically intractable to optimize directly due to the presence of the logsum or logintegral operations (see the info below).
Marginalization in the Context of Joint Probability
When discussing the computation of the joint probability for observed and missing data, the term “marginalizing” refers to summing or integrating over all possible outcomes of the missing data. This process provides a probability distribution based solely on the observed data. For example, let’s assume:
 $X$ is the observed data
 $Z$ is the missing data
 The joint probability for both is represented as $p(X,Z)$
If your primary interest lies in the distribution of $X$ and you wish to eliminate the dependence on $Z$, you’ll need to carry out marginalization for $Z$. For discrete variables, the marginalization involves the logarithm of summation, and for continuous variables, it pertains to integration. In any case, functions that includes the log of a sum o integral defies direct optimization.
Can we get an approximation to this that is more tractable (without a summation or integral within the log)?
Overcoming the Challenge with Expectation Maximization (EM)
To address the optimization challenge in MLE with latent variables, the Expectation Maximization (EM) algorithm is employed. The EM algorithm offers a systematic approach to iteratively estimate both the model parameters and the latent variables.
The algorithm involves two main steps:
 Estep (Expectation step): involves computing the expected value of the completedata loglikelihood with respect to the posterior distribution of the latent variables given the observed data.
 Mstep (Maximization step): Update the model parameters to maximize this expected loglikelihood from the Estep.
By alternating between these two steps, EM ensures that the likelihood increases with each iteration until convergence, thus providing a practical method to fit generative models with latent variables.
For Estep the Variational Lower Bound is used. Commonly referred to as the Empirical Lower BOund (ELBO), is a central concept in variational inference. This method is used to approximate complex distributions (typically posterior distributions) with simpler, more tractable ones. The ELBO is an auxiliary function that provides a lower bound to the log likelihood of the observed data. By iteratively maximizing the ELBO with respect to variational parameters, we approximate the Maximum Likelihood Estimation (MLE) of the model parameters.
Let’s reconsider our aim to maximize the loglikelihood of observations $x$ in terms of $q_\phi(zx)$.
$$\log p_\theta(x) = \log \int z p_\theta(x,z)dz$$ $$ = \log \int z \frac{p_\theta(x,z)q_\phi(zx)}{q_\phi(zx)}dz$$ $$= \log E_{z \sim q_\phi(zx)} \left[ \frac{p_\theta(x,z)}{q_\phi(zx)} \right]$$ $$\geq E_z \left[ \log \frac{p_\theta(x,z)}{q_\phi(zx)} \right] \quad \text{(by Jensen’s inequality)}$$ $$= E_z[\log p_\theta(x,z)] + \int z q_\phi(zx) \log \frac{1}{q_\phi(zx)} dz$$ $$= E_z[\log p_\theta(x,z)] + H(q_\phi(zx))$$
In the equation above, the term $H(\cdot)$ denotes the Shannon entropy. By definition, the term “evidence” is the value of a likelihood function evaluated with fixed parameters. With the definition of:
$$L = E_z[\log p_\theta(x,z)] + H(q_\phi(zx)),$$
it turns out that $L$ sets a lower bound for the evidence of observations and maximizes $L$ will push up the loglikelihood of $x$.
Variational Autoencoders (VAEs)
Variational Autoencoders are a specific type of generative model that brings together ideas from deep learning and Bayesian inference. VAEs are especially known for their application in generating new, similar data to the input data (like images or texts) and for their ability to learn latent representations of data.
1. Generative Models and Latent Variables
In generative modeling, our goal is to learn a model of the probability distribution from which a dataset is drawn. The model can then be used to generate new samples. A VAE makes a specific assumption that there exist some latent variables (or hidden variables) that when transformed give rise to the observed data.
Let $x$ be the observed data and $z$ be the latent variables. The generative story can be seen as:
 Draw $z$ from a prior distribution, $p(z)$.
 Draw $x$ from a conditional distribution, $p(xz)$.
2. Problem of Direct Inference
As discussed previously, direct inference for the posterior distribution $p(zx)$ (i.e., the probability of the latent variables given the observed data) can be computationally challenging, especially when dealing with highdimensional data or complex models. This is because:
$$ p(zx) = \frac{p(xz) p(z)}{p(x)} $$
Here, $p(x)$ is the evidence (or marginal likelihood) which is calculated as:
$$ p(x) = \int p(xz) p(z) dz $$
As we saw this integral is intractable for most interesting models.
3. Variational Inference and ELBO
To sidestep the intractability of the posterior, VAEs employ variational inference. Instead of computing the posterior directly, we introduce a parametric approximate posterior distribution, $q_{\phi}(zx)$, with its own parameters $\phi$.
The goal now shifts to making this approximation as close as possible to the true posterior. This is done by minimizing the KullbackLeibler divergence between the approximate and true posterior using the ELBO function.
4. Neural Networks and Autoencoding Structure
In VAEs, neural networks are employed to parameterize the complex functions. Specifically:
 Encoder Network: This maps the observed data, $x$, to the parameters of the approximate posterior, $q_{\phi}(zx)$.
 Decoder Network: Given samples of $z$ drawn from $q_{\phi}(zx)$, this maps back to the data space, outputting parameters for the data likelihood, $p_{\theta}(xz)$.
The “autoencoder” terminology comes from the encoderdecoder structure where the model is trained to reconstruct its input data.
5. Training a VAE
The training process involves:
 Forward pass: Input data is passed through the encoder to obtain parameters of $q_{\phi}(zx)$.
 Sampling: Latent variables $z$ are sampled from $q_{\phi}(zx)$ using the reparameterization trick for backpropagation.
 Reconstruction: The sampled $z$ values are passed through the decoder to obtain the data likelihood parameters, $p_{\theta}(xz)$.
 Loss Computation: Two terms are considered  reconstruction loss (how well the VAE reconstructs the data) and the KL divergence between $q_{\phi}(zx)$ and $p(z)$.
 Backpropagation and Optimization: The model parameters $\phi$ and $\theta$ are updated to maximize the ELBO.
By the end of the training, you’ll have a model that can generate new samples resembling your input data by simply sampling from the latent space and decoding the samples.
VAEs are a powerful tools, that stay in the intersection of deep learning and probabilistic modeling, and they have a plethora of applications, especially in unsupervised learning tasks.
Variational Encoders with Pytorch
Let create a basic implementation of a Variational Autoencoder (VAE) using PyTorch. The VAE will be designed to work on simple image data, such as the MNIST dataset.




Bibliography
 Doersch, Carl. 2021. “Tutorial on Variational Autoencoders.” January 3, 2021. http://arxiv.org/abs/1606.05908.
 Kingma, Diederik P., and Max Welling. 2019. “An Introduction to Variational Autoencoders.” Foundations and Trends® in Machine Learning 12 (4): 307–92. https://doi.org/10.1561/2200000056.
 Ramchandran, Siddharth, Gleb Tikhonov, Otto Lönnroth, Pekka Tiikkainen, and Harri Lähdesmäki. 2022. “Learning Conditional Variational Autoencoders with Missing Covariates.” March 2, 2022. http://arxiv.org/abs/2203.01218.
 Yunfan Jiang, ELBO — What & Why,Jan 11, 2021, in https://yunfanj.com/blog/2021/01/11/ELBO.html.