TransWikia.com

Why is the autoencoder decoder usually the reverse architecture as the encoder?

Cross Validated Asked by duncster94 on December 3, 2021

The majority of autoencoder architectures I’ve seen have a similar architecture, mainly that the decoder is just the reverse of the encoder. If the goal of the autoencoder is low-dimensional feature learning, why isn’t the decoder simple? An example would be a linear transformation $FW$ where $F$ is an $n$ observation by $f$ feature matrix (i.e. the bottleneck) and $W$ is a learned weight matrix that maps $F$ to the original input feature size. In the case of a deep autoencoder with multiple hidden layers, the decoder in the above example would have low capacity compared to the encoder.

My intuition is the following:

If the decoder is simple, then the autoencoder is forced to learn higher quality features in the bottleneck to compensate. Conversely, if the decoder has high representational capacity, it can map a poorly learned bottleneck to the output reconstructions effectively. The reconstruction error might be lower in this case, but that doesn’t necessarily mean the learned features are actually better.

In my own application (feature learning on graphs), I have found that a simple decoder results in better learned features than a decoder that just mirrors the encoder. In this paper, the authors design a graph autoencoder with a very simple decoder as $hat{A} = sigma(ZZ^T)$ where $hat{A}$ is the reconstructed graph adjacency matrix, $Z$ is the learned feature matrix and $sigma$ is some non-linear transformation like a ReLU.

I’ve been looking around for an answer to this question on-and-off for a while but I haven’t found any explanations or theoretical results as to why a higher capacity decoder is preferable to a low capacity one (or vice-versa). If anyone can provide an explanation or point me in the right direction I’d be grateful.

2 Answers

I wonder if part of the reason might be historical (apparently Hinton's 2006 paper showed it done this way), and because (I believe) it was/is common to tie the weights. I.e. the decoder is using the same weights as the encoder, and they are effectively being learned together.

This question and answer https://stackoverflow.com/q/36889732/841830 discuss the advantages of using tied weights. And some more background here: https://amiralavi.net/blog/2018/08/25/tied-autoencoders

Answered by Darren Cook on December 3, 2021

Your intuition is correct, but it's not in the right context. For starters, let's define "high-quality features" as features that can be recycled for training other models, e.g. transferable. For example, training an (unlabeled) encoder on ImageNet could help give a solid baseline for classification on ImageNet, and on other image datasets.

Most classical autoencoders are trained on some form of (regularized) L2 loss. This means that after encoding a representation, the decoder must then reproduce the original image and is penalized based on the error of every single pixel. While regularization can help here, this is why you tend to get fuzzy images. The issue is that the loss is not semantic: it doesn't care that humans have ears, but does care that skin color tends to be uniform across the face. So if you were to replace the decoder with something really simple, the representation will likely focus on getting the average color right in each region of the image (whose size will roughly be proportional to the complexity of your decoder, and inversely proportional to your hidden layer size).

On the other hand, there are numerous general self-supervised techniques that can learn higher quality semantic features. The key here is to find a better loss function. You can find a really nice set of slides by Andrew Zisserman here. A simple example is a siamese network trained to predict the relative position of pairs of random crops:

enter image description here

In the above, the first crop of the cat's face, and the network needs to predict that the ear-crop should occur north-east of the cat's face. Note that the crops are chosen randomly and the trick is to balance the minimum and maximum distance between crops, so that related crops occur often.

In other words, the network uses a shared encoder and a rudimentary classifier to compare embeddings of different crops. This forces the network to learn what a cat really is as opposed to a soft-set of average colors and feature shapes.

You'll find plenty-more examples in the above slides which also show that these embeddings transfer considerably better than rote autoencoders when trained to predict classes.

Answered by Alex R. on December 3, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP