TransWikia.com

Can residual neural networks use other activation functions different from ReLU?

Artificial Intelligence Asked by jr123456jr987654321 on December 7, 2021

In many diagrams, as seen below, residual neural networks are only depicted with ReLU activation functions, but can residual NNs also use other activation functions, such as the sigmoid, hyperbolic tangent, etc.?

enter image description here

One Answer

The problem with certain activation functions, such as the sigmoid, is that they squash the input to a finite interval (i.e. they are sometimes classified as saturating activation functions). For example, the sigmoid function has codomain $[0, 1]$, as you can see from the illustration below.

enter image description here

This property/behaviour can lead to the vanishing gradient problem (which was one of the problems that Sepp Hochreiter, the author of the LSTM, was trying to solve in the context of recurrent neural networks, when developing the LSTM, along with his advisor, Schmidhuber).

Empirically, people have noticed that ReLU can avoid this vanishing gradient problem. See e.g. this blog post. The paper Deep Sparse Rectifier Neural Networks provides more details about the advantage of ReLUs (aka rectifiers), so you may want to read it. However, ReLUs can also suffer from another (opposite) problem, i.e. the exploding gradient problem. Nevertheless, there are several ways to combat this issue. See e.g. this blog post.

That being said, I am not an expert on residual networks, but I think that they used the ReLU to further avoid the vanishing gradient problem. This answer (that I gave some time ago) should give you some intuition about why residual networks can avoid the vanishing gradient problem.

Answered by nbro on December 7, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP