<a class="commit-link" data-hovercard-type="commit" data-hovercard-url="https://github

Batch size and penalty terms about kge HOT 6 CLOSED

uma-pi1 commented on August 14, 2024

Batch size and penalty terms

from kge.

Comments (6)

samuelbroscheit commented on August 14, 2024

Gradients are not averaged, however the losses are if the loss is instantiated in such a way.

Relevant part is here:

https://github.com/uma-pi1/kge/blob/master/kge/job/train.py#L270

            penalty_values = self.model.penalty(
                epoch=self.epoch, batch_index=batch_index, num_batches=len(self.loader)
            )
            for pv_index, pv_value in enumerate(penalty_values):
                penalty_value = penalty_value + pv_value
                if len(sum_penalties) > pv_index:
                    sum_penalties[pv_index] += pv_value.item()
                else:
                    sum_penalties.append(pv_value.item())
            sum_penalty += penalty_value.item()
            batch_forward_time += time.time()
            forward_time += batch_forward_time

            # backward pass
            batch_backward_time = -time.time()
            cost_value = loss_value + penalty_value
            cost_value.backward()

penalty_value is not averaged.

from kge.

rgemulla commented on August 14, 2024

Loss average implies gradient average. The relevant code piece is here:

            loss_value, batch_size, batch_prepare_time, batch_forward_time = self._compute_batch_loss(
                batch_index, batch
            )
            sum_loss += loss_value.item() * batch_size

_compute_batch_loss is thus supposed to average, that's why it's multiplied by the batch size afterwards.

7aa82ce should thus be reverted. If anything, the norms should be averaged over the number of training examples. But since this is constant (and can be thus be viewed as part of lambda) and not meaningful for all penalties, we shouldn't do this.

from kge.

samuelbroscheit commented on August 14, 2024

So smaller batch size (bs) means more penalty per epoch? With your suggestion we apply a penalty of N / bs * || T || per epoch and with mine || T || , correct?

Another question to this topic: Why didn't we implement the weighted norm like in the CP paper?

from kge.

rgemulla commented on August 14, 2024

Perhaps _compute_batch_loss should be renamed into _compute_batch_loss_avg to be easier to interpret.

from kge.

rgemulla commented on August 14, 2024

Yes, more penalty per epoch. But also more loss per epoch, since the gradient of every example now has more impact in every step. Again, without the patch:

E[gradient] = E[gradient of a random example] + gradient of penatly term

This is what we want: the expected gradient is independent of the batch size.

from kge.

rgemulla commented on August 14, 2024

As for weighted norm: that's a separate issue. If you mean that frequency-based weighting is not implemented: #20

from kge.

Batch size and penalty terms about kge HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent