Hi, I'm trying to apply NLLvMF loss to language modelling based on y

weird value of LogCmk about seq2seq-con HOT 13 CLOSED

sachin19 commented on July 21, 2024

weird value of LogCmk

from seq2seq-con.

Comments (13)

Sachin19 commented on July 21, 2024

Hi,

That is not too different from the values I obtained in the initial phases of training.
Thanks for point this out, this is in fact a typo. I will correct it.

Thanks,

from seq2seq-con.

xxchauncey commented on July 21, 2024

That's pretty strange. When I tried to reproduce your result, the approximation of log(C_m(k)) returns a negative value (~ -600) while the exact value is positive (~ 400). That seems to indicate the exact calculation is correct and there's something wrong with my approximation, which is also pretty similar to the issue reported in #2.

By the way, in the line 34 and 35, do we indeed need to normalize the output and target embedding?

In the given formula, I think mu is equal to the normalized output so that kappa*mu would become e_hat which is exactly the raw output (where kappa is the norm of e_hat). Therefore, I guess e_hat doesn't need to be normalized.

As for the target embedding, in the previous initialization step, I noticed that you already normalized the embeddings when loading them to the weight matrix. Is it also unnecessary to normalize the target embedding?

Thanks!

from seq2seq-con.

xxchauncey commented on July 21, 2024

That's pretty strange. When I tried to reproduce your result, the approximation of log(C_m(k)) returns a negative value (~ -600) while the exact value is positive (~ 400). That seems to indicate the exact calculation is correct and there's something wrong with my approximation, which is also pretty similar to the issue reported in #2.

By the way, in the line 34 and 35, do we indeed need to normalize the output and target embedding?

In the given formula, I think mu is equal to the normalized output so that kappa*mu would become e_hat which is exactly the raw output (where kappa is the norm of e_hat). Therefore, I guess e_hat doesn't need to be normalized.

As for the target embedding, in the previous initialization step, I noticed that you already normalized the embeddings when loading them to the weight matrix. Is it also unnecessary to normalize the target embedding?

Thanks!

The following is the code segment of mine:

    def log_Cm_k(self, kappa):
        k = kappa.detach().double()
        # result = (self.m / 2 - 1) * torch.log(k) - (self.m / 2) * np.log(2 * np.pi) - torch.log(scipy.special.ive(self.m / 2 - 1, k)) - k
        v = self.m / 2 - 1
        result = torch.sqrt((v + 1) ** 2 + k ** 2) - (v - 1) * torch.log(v - 1 + torch.sqrt((v + 1) ** 2 + k ** 2))
        return result.float()

    def forward(self, output, target):
        tar_emb = self.embeddings(target)

        # tar_emb_norm = F.normalize(tar_emb, p=2, dim=-1)  # e(w)
        # out_emb_norm = F.normalize(output, p=2, dim=-1)  # e^
        tar_emb_norm = tar_emb
        out_emb_norm = output

        print('\nFirst Output Embedding:', out_emb_norm[0])
        kappa = output.norm(p=2, dim=-1)  # ||e^||
        print('\nValue of Kappa (norm):', kappa[0])

        print('\nApproximation of LogCmk:', self.log_Cm_k(kappa[0]))

        # loss = - self.log_Cm_k(kappa) - (out_emb_norm * tar_emb_norm).sum(dim=-1)
        loss = - self.log_Cm_k(kappa) + torch.log(1 + kappa) * (0.2 - (out_emb_norm * tar_emb_norm).sum(dim=-1))

        loss = loss.float().mean()

        return loss

and the output is (take the first output embedding as example) when the embedding size is 300:

First Output Embedding: tensor([ 0.0054,  0.0112, -0.0089, -0.0022,  0.0018, -0.0107,  0.0149,  0.0017,
        -0.0041, -0.0037,  0.0115, -0.0065,  0.0211,  0.0437, -0.0045, -0.0017,
         0.0193,  0.0230,  0.0047,  0.0083, -0.0063, -0.0044,  0.0019, -0.0077,
        -0.0005, -0.0019,  0.0156,  0.0005, -0.0103, -0.0071,  0.0112, -0.0012,
         0.0118,  0.0006,  0.0029, -0.0146, -0.0144,  0.0094,  0.0075,  0.0261,
         0.0155, -0.0112,  0.0016,  0.0037,  0.0061, -0.0244,  0.0139, -0.0180,
        -0.0118, -0.0108,  0.0023,  0.0239, -0.0300, -0.0007, -0.0110, -0.0019,
        -0.0232, -0.0009,  0.0025,  0.0000, -0.0020,  0.0144, -0.0230, -0.0124,
         0.0047, -0.0049, -0.0122, -0.0259,  0.0063, -0.0044,  0.0117,  0.0013,
         0.0528,  0.0033,  0.0016, -0.0105,  0.0220,  0.0155,  0.0352,  0.0366,
         0.0044,  0.0022, -0.0207,  0.0072, -0.0074, -0.0215,  0.0159, -0.0220,
         0.0059,  0.0040, -0.0095, -0.0071, -0.0115, -0.0039, -0.0318, -0.0081,
        -0.0166, -0.0238, -0.0221,  0.0056,  0.0475, -0.0023,  0.0038, -0.0003,
        -0.0026,  0.0037, -0.0016, -0.0038,  0.0056, -0.0059, -0.0056,  0.0054,
         0.0189, -0.0062,  0.0115, -0.0151,  0.0125,  0.0305, -0.0019, -0.0181,
         0.0005,  0.0085,  0.0059,  0.0044, -0.0246,  0.0100,  0.0117, -0.0191,
        -0.0129,  0.0283, -0.0055, -0.0271,  0.0059, -0.0104,  0.0113,  0.0046,
        -0.0077,  0.0140,  0.0069, -0.0024,  0.0130,  0.0007, -0.0137, -0.0045,
         0.0033, -0.0268, -0.0022, -0.0226, -0.0059, -0.0156, -0.0058, -0.0086,
         0.0090,  0.0083, -0.0233, -0.0027, -0.0062, -0.0191,  0.0092, -0.0222,
         0.0079, -0.0119,  0.0272,  0.0227, -0.0063, -0.0040,  0.0211, -0.0008,
        -0.0051, -0.0034, -0.0021,  0.0100, -0.0298,  0.0210, -0.0255, -0.0060,
        -0.0205, -0.0011,  0.0122, -0.0406,  0.0020, -0.0119,  0.0171,  0.0262,
         0.0078, -0.0363,  0.0032,  0.0111, -0.0044, -0.0177,  0.0308, -0.0180,
        -0.0131,  0.0055,  0.0063, -0.0040, -0.0334, -0.0175,  0.0297,  0.0012,
        -0.0046, -0.0065,  0.0193,  0.0146, -0.0044, -0.0064, -0.0022, -0.0195,
        -0.0102, -0.0085, -0.0025, -0.0119,  0.0108,  0.0152,  0.0356,  0.0067,
        -0.0092,  0.0034, -0.0174, -0.0261,  0.0229, -0.0317,  0.0095,  0.0163,
        -0.0203, -0.0007,  0.0015, -0.0238, -0.0053,  0.0286, -0.0073,  0.0061,
         0.0168,  0.0085,  0.0114,  0.0034,  0.0131,  0.0058,  0.0020, -0.0104,
         0.0177,  0.0067, -0.0195, -0.0280, -0.0075, -0.0002,  0.0053, -0.0091,
        -0.0157,  0.0006,  0.0087, -0.0035,  0.0174, -0.0112, -0.0036,  0.0018,
         0.0127,  0.0031, -0.0080,  0.0132, -0.0142,  0.0318,  0.0048, -0.0215,
        -0.0073,  0.0089,  0.0024, -0.0103, -0.0090, -0.0260, -0.0082,  0.0028,
         0.0106,  0.0158, -0.0020,  0.0097,  0.0013, -0.0027,  0.0096,  0.0033,
         0.0073,  0.0070,  0.0087, -0.0065,  0.0101,  0.0247, -0.0131, -0.0033,
        -0.0131,  0.0387,  0.0141, -0.0124, -0.0241,  0.0086,  0.0123,  0.0268,
        -0.0127,  0.0035,  0.0256,  0.0316], grad_fn=<SelectBackward>)

Value of Kappa (norm): tensor(0.2635, grad_fn=<SelectBackward>)

Approximation of LogCmk: tensor(-693.1697)

from seq2seq-con.

xxchauncey commented on July 21, 2024

Hi,

That is not too different from the values I obtained in the initial phases of training.

Thanks for point this out, this is in fact a typo. I will correct it.

Thanks,

Hi,

I again checked both your paper and the code, I think that's not a typo in the code.

I differentiated the approximation of log(C_m(k)) provided in the appendix of the paper and got a result with opposite sign of the approximation of the derivative. I doubt that it's typo in the paper rather than in the code.

from seq2seq-con.

Sachin19 commented on July 21, 2024

Hi Chauncey,
I wrote my previous answer in haste, here are some clarifications:

The notation is slightly confusing. The approximation provided in the paper/code is not an approximation of log(C_m(k)). It is just a function whose gradient can serve as the approximation of the gradient of `log(C_m(k))'. The integral is just provided as an easy way to serve as a psuedo loss so that gradients can be computed by pytorch's autograd. That is the reason the loss values don't exactly match.
As you pointed out, my original code without the negative sign was correct. The psuedo-loss that is provided has a (approximate) gradient which is negative of the gradient of log(C_m(k)). Since the negative sign is already incorporated, we don't need to negate it again. Your original comment threw me off. There is a mistake in the paper, that I referred to the integral as log(C_m(k)) in the appendix, which I will correct, thanks for pointing it out.
Yes, you need to normalize the target embedding. The way kappa and mu is being computed from the output vector is exactly the way you wrote in your code.
If you notice that the loss is not working for you, it is likely that you need to tune the hyperparameter '\lambda'. I suggest you first try training with cosine loss to see how it performs and later switch to vMF. I have noticed in some of my experiments that with reasonably sized vocabularies (~30000), cosine works just fine.

Hope this addresses your concerns.

from seq2seq-con.

xxchauncey commented on July 21, 2024

May I ask you how dose your loss change: starting from ~ -400 to what value?

I'm just curious because my loss keep decreasing to large numbers.

from seq2seq-con.

Sachin19 commented on July 21, 2024

For most of my experiments the loss value in the beginning is around -400 which reduces to -430 when it converges. In your case, the loss is likely decreasing to large numbers because regularization as described in the paper is not enough. The loss has two terms: -\log(C_m(kappa)) and -e_hat . e_target. If proper regularization isn't used, the model can just keep on increasing \kappa (length of e_hat) which will keep on decreasing the loss value. That is why in one of the experiments I try log(kappa) which makes the growth of kappa sub-linear (in the second loss component). You might want to try decreasing the value of \lambda_2

from seq2seq-con.

xxchauncey commented on July 21, 2024

Thank you very much for your replies, and all these do indeed help a lot.

If you don't mind, I also wonder the final value of your cosine loss when training with NLLvMF loss.

from seq2seq-con.

Sachin19 commented on July 21, 2024

Happy to help! The final cosine loss was in the range 0.17-0.20 for different datasets.

from seq2seq-con.

xxchauncey commented on July 21, 2024

Can you please provide some suggestions on why the cosine loss was high? I got both models trained with NLLvMF loss and cosine loss separately converging to around 0.5.

from seq2seq-con.

Sachin19 commented on July 21, 2024

In my case, the cosine loss wasn't high. I defined cosine loss as 1-cosine_similarity(e_out, e_target), and a range of 0.17-0.20 is quite reasonable.

If you are doing it for language modeling, 0.5 doesn't seem that bad. If you compare it with softmax based perplexity, for machine translation datasets, after training it lies ~5-10 and for language modeling it's usually much higher ~40-100.

I would be able to help more if you could tell me more details about your experiment.

from seq2seq-con.

xxchauncey commented on July 21, 2024

My experiment is basically to see how well NLLvMF loss works on language modelling task. The neural model I'm using is the AWD-LSTM. In the experiment, I tried to drop the final linear layer which produces logits for Softmax, and replace the cross-entropy loss with NLLvMF loss.

As we know, for language modelling, the commonly used evaluation method is the perplexity which requires the probability of each words in the vocabulary. It seems that at testing time the score (i.e. the log likelihood) generated by vMF pdf cannot be considered as ideal probability for calculating perplexity because I still have not got any good result after tuning the value of lambda. I guess the score can be just used for ranking?

This question can be extended to the machine translation when doing beam search rather than greedy search. The best way to do beam search is to multiply the probability of each word in the sub-sequence and compare that computed probability, so the probability of words should be obtained. I'm not sure whether it is reliable to do beam search by summing up scores (log likelihood of density) and comparing the accumulated score.

from seq2seq-con.

Sachin19 commented on July 21, 2024

Hi, It seems I missed your message. Yes, the vMF gives a probability density not a probability so they are not directly comparable to probabilities or perplexity. They can be used for ranking though I guess.

Theoretically, there's nothing wrong with using the densities for doing beam search. But in our experiments we didn't get any improvements with them.

from seq2seq-con.

weird value of LogCmk about seq2seq-con HOT 13 CLOSED

Comments (13)

Related Issues (12)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent