In the original implementation, there is
# tensorflow
weight = tf.stop_gradient(tf.math.cumprod(tf.concat([tf.ones_like(disc[:1]), disc[:-1]], 0), 0))
and in your code:
|
discount_arr = torch.cat([torch.ones_like(discount_arr[:1]), discount_arr[1:]]) |
# pytorch
discount_arr = torch.cat([torch.ones_like(discount_arr[:1]), discount_arr[1:]])
discount = torch.cumprod(discount_arr[:-1], 0)
I've tested that they are different when using pcon
predictor. Like:
# tensorflow
x = np.arange(9).reshape(3,3)*0.1
y = tf.convert_to_tensor(x)
z = tf.math.cumprod(tf.concat([tf.ones_like(y[:1]), y[:-1]], 0), 0)
>>> z:
<tf.Tensor: shape=(3, 3), dtype=float64, numpy=
array([[0. , 0.1, 0.2],
[0.3, 0.4, 0.5],
[0.6, 0.7, 0.8]])>)
# pytorch
x = np.arange(9).reshape(3,3)*0.1
y = torch.as_tensor(x)
z = torch.cumprod(torch.cat([torch.ones_like(y[:1]), y[1:]]),0)
>>> z:
tensor([[1.0000, 1.0000, 1.0000],
[0.3000, 0.4000, 0.5000],
[0.1800, 0.2800, 0.4000]], dtype=torch.float64)
So why is the calculation different?
There is a reason I guess is that, because the pcon
predictor is a Bernoulli
distribution, so the samples are always either 0
or 1
. Thus these two different ways of calculating discount weight will always bring the same output, is that right?
But what if we want the pcon
predictor to output a "soft" label, then which one is right?