TransWikia.com

How to use Variational Autoencoder's μ and σ with user-generated z?

Data Science Asked by unsure_automata on December 14, 2020

My understanding of VAE is that unlike Autoencoders, it does not directly give you a discrete encoding (latent code vectors n-dim) instead, it gives you both mu and sigma (n- dim mean vectors and n-dim standard deviation vectors). Then you have epsilon which you use to sample from a normal distribution with mu and sigma to create z. When combining mu, sigma and epsilon, you get z which is the one decoded by the VAE’s decoder. z is basically the main encoding.

Say my z, mu, sigma are of n-dimension like 10 (10-dim z, mu, sigma). I enable the user to have a free picking/giving me numbers 10 vectors [-a, a], say a = 5. So the user is free to pick 10 vectors between -5, 5.

This becomes my z that is decoded by my decoder to generate a new image.

[Main problem]

My VAE is trained on a dataset of apparel. Now, if I run my VAE’s encoder on each of the data in the dataset, I’d get a mu and sigma for each (not sure if this is still correct).

Using the z given by the user, how do I find the most similar from the dataset using VAE’s encoding of only mu and sigma?

My thinking is to generate z using mu and sigma generated by VAE’S encoder but in order to generate z, I still need to sample from a distribution using epsilon in which makes it non-discrete w.r.t user-generated z. This adds randomness to it so I am not sure how would I use the encoded z to find the nearest to user-generated z.

One Answer

If you are training a VAE the encoder is essentially parameterizing the variational posterior distribution of $z$ given $x$, i.e. $$ q(z | x) = prod_{i=1}^{N_z} q(z_i | x) = prod_{i=1}^{N_z} frac{1}{sqrt{2pi}sigma_i(x)} exp left[ -frac{(z_i - mu_i(x))^2}{2sigma_i(x)^2} right] $$ where $mu_i(x)$ and $sigma_i(x)$ are given by the encoder and $z$ is in the $N_z$-dimensional latent space. I would think of the problem as if the $x$ are parameters of a probability distribution and the $z$ are some observations you made. The "nearest" encoding from your training data $x$ would be the encoding with the highest likelihood, i.e. you compute the likelihood for each data point for a given $z$ by evaluating the above expression and take the $x$ with the maxmal value.

The log-likelihood is usually used in these scenarios because it's more convenient, but it is equivalent, as the likelihood is non-negative and the log is a monotonic function.

In the comments you mentioned to use a distance metric. The log-likelihood provides a nice interpretation, because it gives you something similar to the negative euclidean distance between $mu(x)$ and $z$, but scales and shifts by terms determined by the standard deviation:

$$ log q(z|x) = sum_{i=1}^{N_z} log q(z_i|x) = sum_{i=1}^{N_z} left[ -frac{(z_i - mu_i(x))^2}{2sigma_i(x)^2} - log left( sqrt{2pi}sigma_i(x) right) right] $$

So intutively, by maximizing the (log-)likelihood, you are minimizing the euclidean distance between a given z and the encoding $mu(x)$ of $x$ from the training data set, but you pay penalties for having large variances.

(Furthermore, if you do it this way, there is no sampling of $epsilon$ required.)

Answered by matthiaw91 on December 14, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP