# CNN: Details of Zeiler Fergus Net

I want to replicate the modified AlexNet by Zeiler and Fergus from 2013 (Visualizing and Understanding Convolutional Networks) but they spare some details. Hope someone here knows more about it.

1. What is their exact learning rate schedule? They just write “We
anneal the learning rate throughout training manually when the
validation error plateaus”.

2. Do they use weight decay?

3. In which layers do they “renormalize” the filters (they do not
divide the input by the global standard deviation)?

4. I do not understand their architecture completely: In the first
layer: 224 -> 110 with filters of width/height 7 and stride 2. Do
they add a padding of one only on one side because 110*2+5=225 or am
I wrong? Same for 3×3 maxpooling 26 -> 13 with stride 2.

Cross Validated Asked by vrx on December 26, 2020

1-) That is a type of learning procedure. As certain learning rate can't reduce objective function further, learning rate is reduced and training continues. This behaviour is similar to over-shooting. After some time, the learning rate may become too big to reduce error rate. So it is reduced in some degree. The simplist one is to divide the learning rate by a constant, 5,10 i.e.

2-) I think they did, because AlexNet has used it. Most of their settings are taken from AlexNet.

3-)

4-) During pooling, padding may used to complete non-overlapping regions of input space and pooling region. For example, 3x3 pooling with 2 strides on 26x26 input region should be padded with 1 from single side.

Answered by yasin.yazici on December 26, 2020

## Related Questions

### Specifying and extracting random intercepts and slopes from GAMM using bam in mgcv

1  Asked on January 1, 2022 by wdnsd

### Where is the proof that replacing missing lagged values with zero in Arellano-Bond like estimators is a valid approach?

0  Asked on January 1, 2022 by federico-tedeschi

### How to extend Plackett-Burman design to further explore the interactions?

1  Asked on January 1, 2022 by nothanks93330

### What is the difference between RMSE and SEP

1  Asked on January 1, 2022 by tiago-dias

### How to choose an optimizer for your Neural Network?

1  Asked on January 1, 2022 by white

### Box-Cox data transformation to enable linear regression

1  Asked on January 1, 2022 by cbgodbole

### Why isn’t simulation showing that ridge regression better than linear model

1  Asked on January 1, 2022 by andy_dorsey

### Does the t statistic have uses unrelated to hypothesis testing?

2  Asked on January 1, 2022

### survival analysis using unbalanced sample

2  Asked on December 29, 2021 by jessi

### Randomly sample point from a 2D pdf image

1  Asked on December 29, 2021 by c-wang

### Spline regression with many features in R

1  Asked on December 29, 2021 by user2117258

### transfer function-noise modelling in R

1  Asked on December 29, 2021 by stucash

### Interpreting classification report scores

1  Asked on December 29, 2021 by mxavier

### Laplace mechanism on vector record?

1  Asked on December 29, 2021

### About Murphy’s notation: why is $p(y|x, theta)$ a conditional expectation when there is no probabilistic interpretation on $x$ or $theta$?

1  Asked on December 29, 2021

### Can I fit a complex model in two stages (maximum likelihood)?

0  Asked on December 29, 2021

### Should I run a machine learning model many times?

2  Asked on December 29, 2021 by ka28mi

### Which correlation method to us?

1  Asked on December 29, 2021

### How to interpret coefficients from regularized cox regression?

1  Asked on December 29, 2021