Variational autoencoders

Auto-encoder models are used to model , the joint probability of , the observed data, and , the latent variable. The joint probability is normally factorized as . During inference, we are interested in finding good values of to produce observed data – that is, we are interested in learning the posterior probability of  given . Using Bayes' rule, we can rewrite the posterior as follows:

Close inspection of this equation reveals that computing the evidence, the marginal distribution of the data , is hardly possible and normally intractable. We first try to circumvent this barrier by computing an approximation of the evidence. We do it by using variational inference, and, instead, estimating the parameters of a known distribution  that is the least divergent from the posterior Variational inference approximates the posterior  with a family of distributions , where the variational  parameter indexes the family of distributions. This can be done by minimizing the KL divergence between  and , as described in the following equation:

Unfortunately, using does not circumvent the problem, and we are still faced with computing the evidence . At this point, we give up on computing the exact evidence and focus on estimating an Evidence Lower Bound (ELBO). The ELBO sets the perfect scenario for Variational Autoencoders, and is computed by removing of the previous equation and inverting the signs, giving:

VAEs consist of an encoder   parametrized by , and a decoder   parametrized by . The encoder is trained to maximize the posterior probability of a  latent vector, given the , . data. The decoder is trained to maximize the probability of the data  given the latent representation  latent vector, . Informally speaking, the encoder learns to compress the data into a latent representation, and the decoder learns to decompress the data from the latent representation. The VAE loss is defined as follows:

The first term represents the reconstruction loss, or the expectation of the negative probability. The second term is a regularize term that was derived in our problem setup. 

Unlike autoregressive models, VAEs are normally easy to run in parallel during training and inference. Conversely, they are normally harder to optimize than autoregressive models. 

Deep feature-consistent VAE is one of the best models for image generation using VAEs. The following figure shows the faces generated by the model. From a qualitative perspective, image samples produced with VAEs tend to be blurry. 

Source: Deep feature-consistent VAE (https://arxiv.org/abs/1610.00291)