<a target="_blank" rel="noopener noreferrer" href="https://private-user-images.githubu

The introduction of Fixed Variance Initialization and One Batch Norm is different from code. about katago HOT 5 OPEN

Nightbringers commented on August 15, 2024

The introduction of Fixed Variance Initialization and One Batch Norm is different from code.

from katago.

Comments (5)

lightvector commented on August 15, 2024

Variance is proportional to the index of blocks + 1, because every time you have a (+), you are adding two things that we idealize as uncorrelated random vectors. When you add two things that are uncorrelated, their variance is simply the sum of the variances. Since the output of each block is idealized to be variance 1 due to the normalization within that block, the total variance of the activations in the trunk is the index of blocks + 1.

from katago.

Nightbringers commented on August 15, 2024

what if i calculate the real variance of the input? What will happen?

from katago.

lightvector commented on August 15, 2024

I updated https://github.com/lightvector/KataGo/blob/master/docs/KataGoMethods.md#fixed-variance-initialization-and-one-batch-norm with an additional diagram to make this more clear.

The point of this is to choose an initialization scale so that the entire operation the net is performing is variance-preserving, as of initialization. If you use the real variance of the input to the net instead of assuming the variance is 1, then the rule for each K would probably be to make the output of the normalized layer equal to whatever that real variance is, so that variance scale is constant through the whole net from input to output.

If you still idealize all the layers and sums properties, then it makes no difference because that will simply scale all the variances of all activations in the neural net proportionally and the K for each normalization will be the same.

If you use the actual empirical variance of every layer in the net instead of idealizing it, and normalize by the empirical value (taking into account the effect of doing so on subsequent layers properly), and continually update it throughout training, then you basically have batch normalization, or something very similar to it, depending on your implementation details.

from katago.

Nightbringers commented on August 15, 2024

NestedBottleneckResBlock is different from introduction. Because use_repvgg_linear=True, it will add one more conv1x1 in NormActConv.

this is the code:
if self.conv1x1 is not None:
out = self.conv(out) + self.conv1x1(out)
else:
out = self.conv(out)

from katago.

lightvector commented on August 15, 2024

Yes, there are some details like that, good noticing. In this case, it is still equivalent at inference time to not having the 1x1 convolution at all though, and in fact the C++ code doesn't have any 1x1 convolution there for the net that gets exported for self-play. You can add the 1x1 convolution weight directly into the center cell of the 3x3 convolution weights and then it is exactly equivalent to perform just the 3x3 convolution with the center cell of the 3x3 convolution having a higher learning rate.

Edit: So mostly, you can consider this a pretty unimportant detail. The training of the net is almost the same if you set use_repvgg_linear=False, so it's not really worth mentioning in a section that discusses nested bottleneck residual blocks - whether we have the extra 1x1 or not is pretty orthogonal to what the overall architecture with the bottleneck blocks is accomplishing.

from katago.

The introduction of Fixed Variance Initialization and One Batch Norm is different from code. about katago HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent