Why is the stochastic gradient of a layer almost orthogonal to its weight?

by Maybe   Last Updated October 20, 2019 03:19 AM

In the paper Fixup Initialization: Residual Learning Without Normalization. In Page 5 when talking about effects of multipliers, the authors mentioned that

Speci´Čücally, as the stochastic gradient of a layer is typically almost orthogonal to its weight, learning rate decay tends to cause the weight norm equilibrium to shrink when combined with L2 weight decay

I have two questions for this statement:

  1. Why is the stochastic gradient of a layer almost orthogonal to its weight?
  2. I can understand that L2 weight decay cause the weight norm to shrink. But why learning rate decay tends to cause the weight norm equilibrium to shrink when combined with L2 weight decay?


Related Questions


Updated December 08, 2017 23:19 PM

Updated June 13, 2015 04:08 AM

Updated June 14, 2015 03:08 AM