Hi, It seems that you might have an extra term in this <a href="http

Extra term in `LogCmk` forward about seq2seq-con HOT 10 CLOSED

amanjitsk commented on July 21, 2024

Extra term in `LogCmk` forward

from seq2seq-con.

Comments (10)

Sachin19 commented on July 21, 2024 1

Thanks for pointing this out. You are right. I'll update the lambda values.

from seq2seq-con.

Sachin19 commented on July 21, 2024

Hi,

The loss term requires computing Bessel function (not exponentially scaled Bessel function) which is simply scipy.special.iv but this function is not numerically stable and can lead to issues while training. That is why we instead use scipy.special.ive() with a -k term which should give the same value as scipy.special.iv.

from seq2seq-con.

amanjitsk commented on July 21, 2024

Noted! Cool, thanks for the quick reply!

from seq2seq-con.

amanjitsk commented on July 21, 2024

I was also trying to reproduce the loss function from the paper. I see the losses implemented in loss.py but I am unable to figure what's going on here - particularly where is the

torch.log(1 + kappa) * (0.2-(out_vec_norm_t*tar_vec_norm_t).sum(dim=-1))

coming from ? And why is the output embedding unit normalized ? Does that not defeat the purpose of the norm regularization version ? Could you shed some light on what's going on about here ?

Thanks,
Amanjit

from seq2seq-con.

Sachin19 commented on July 21, 2024

This is just another regularization we were experimenting with which gives slightly better results. Line 41 (commented) gives the exact loss we used in the paper. kappa is the norm of the output vector, log of which is multiplied with the dot product of the normalized vectors in the modified loss. So it is playing a role in the loss computation. I'm not sure what you mean by "defeat the purpose of the norm regularization version".

from seq2seq-con.

amanjitsk commented on July 21, 2024

Right, thanks for the reply. Yeah I mean I was just trying to figure out the motivation for that loss objective. The reason I said the last statement was because I was not sure if the model outputs unit normed vectors (by construction - for ex. just enforced by taking the output of the network and unit normalizing it), because the first loss term (logcmk) would be constant if ||e_hat|| = 1. so I guess my question is, do you just unit normalize the output of the network before doing the nearest neighbour search at evaluation/test time or do you expect that the model will output "approximately unit normalized" vectors and take them as given by the network ?

from seq2seq-con.

Sachin19 commented on July 21, 2024

The purpose of the regularization we mention in the paper is to control the length of the output vector. By taking it's log as in line 42, we aim to reduce the effect the length of the vector has on the loss as we did with lambda_2. It just empirically works better.

The model doesn't enforce any constraint on the output vectors, so ||e_hat|| is not 1. While using vMF loss, we do nearest neighbor search by using vMF probabilities as the metric, where norm of the output vector is playing a role. There is no requirement for the output vectors to be normalized. If you look at line 41, the loss is just written in a decomposed form as norm multiplied by unit length vector, which is the same as the actual vector itself.

from seq2seq-con.

amanjitsk commented on July 21, 2024

Right, makes sense, I was completely ignoring the vMF probability distribution and was assuming naive neighbour search, thanks for the clarification. So if I understand correctly, only the ground truth word embeddings need to be unit-normalized as required by the vMF density. Also, perhaps just a minor question, is there a reason for not vectorizing the code for the loss functions loss.py, or was it simply for more fine-grained control ?

from seq2seq-con.

xxchauncey commented on July 21, 2024

Hi,

Previously you said that scipy.special.iv is numerically unstable, did you mean it can cause underflow in Python?

I applied the vMF loss to my model, and it seems either scipy.special.iv or scipy.special.ive will both return 0.0, thus the logarithm becomes the infinity. The value of 'kappa' produced by my model is just around 0.5, and the value of m I used is 300.

Does that suggest I should approximate the logarithm of C_m?

from seq2seq-con.

VictorSanh commented on July 21, 2024

This is just another regularization we were experimenting with which gives slightly better results. Line 41 (commented) gives the exact loss we used in the paper. kappa is the norm of the output vector, log of which is multiplied with the dot product of the normalized vectors in the modified loss. So it is playing a role in the loss computation. I'm not sure what you mean by "defeat the purpose of the norm regularization version".

It's just a small question of hyper-parameters: in line 41, it seems to me that lambda_1 and lambda_2 are inverted. From my understanding of the paper, 0.1 should be the multiplicative factor in front of the cosine similarity, and not 0.01. Am I missing something?

from seq2seq-con.

Extra term in `LogCmk` forward about seq2seq-con HOT 10 CLOSED

Comments (10)

Related Issues (12)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent