Hi I am facing this strange problem - which I am struggling for ages

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I just tried to reproduce this issue. There's a test called <code class="notranslate"

So this is getting more tricky I printed from within the train

Cannot match validation loss from training when calculating after training about nolearn HOT 17 CLOSED

dnouri commented on July 2, 2024

Cannot match validation loss from training when calculating after training

from nolearn.

Comments (17)

dnouri commented on July 2, 2024

Hmm, maybe it'd help to see some code. Number 3 isn't entirely clear to me. What does your EarlyStopping implementation look like?

from nolearn.

run2 commented on July 2, 2024

Here it is. There are some thing which I do with the more_params thing - but rest is almost same as your code.

class EarlyStopping(object):

    def __init__(self, patience=100):
        self.patience = patience
        self.best_valid = np.inf
        self.best_valid_epoch = 0
        self.best_weights = None

    def __call__(self, nn, train_history):
        if(bool(nn.more_params) and 'reset' in nn.more_params and nn.more_params['reset'] == 1):
            self.best_valid = np.inf
            self.best_valid_epoch = 0
            self.best_weights = None
            nn.more_params['reset'] = 0
            #print 'Patience is set at ' + str(self.patience)
            #print 'Max epochs is ' + str(nn.max_epochs)

        current_valid = train_history[-1]['valid_loss']
        current_epoch = train_history[-1]['epoch']
        #print str(current_epoch)
        #if(current_epoch%100==0):
        #    print("Saving state.")
        #    print("Best valid loss was {:.6f} at epoch {}.".format(
        #        self.best_valid, self.best_valid_epoch))
        #    nn.load_weights_from(self.best_weights)
        #    with open('models/' + current_epoch + '.model', 'wb') as f:
        #        pickle.dump(nn, f, -1)

        if current_valid < self.best_valid:
            print 'Ressing best'
            self.best_valid = current_valid
            self.best_valid_epoch = current_epoch
            self.best_weights = [w.get_value() for w in nn.get_all_params()]
            nn.more_params['best_valid_fold_' + str(nn.more_params['fold'])] = self.best_valid
        if (self.best_valid_epoch + self.patience < current_epoch):
            print("Early stopping.")
            print("Best valid loss was {:.6f} at epoch {}.".format(
                self.best_valid, self.best_valid_epoch))
            nn.load_weights_from(self.best_weights)
            nn.more_params['best_valid_fold_' + str(nn.more_params['fold'])] = self.best_valid
            raise StopIteration()
        elif (current_epoch == nn.max_epochs):
            print("Loading best weights")
            nn.load_weights_from(self.best_weights)
            nn.more_params['best_valid_fold_' + str(nn.more_params['fold'])] = self.best_valid

from nolearn.

run2 commented on July 2, 2024

So - I can see the "Loading best weights" being printed - only when the valid loss decreases and - also at the max epoch (if it was a smooth decrease till the max epoch). You will also notice that I have got the Resetting bit (sorry that is a typo there as Ressing) in its own if - such that the load weights is independent of that if (which I believe is right).

Point - is - I get the weights loaded correctly from the best weights - whether max epoch or patience override. But, when I use that net, coming out of the .fit() call, it gives me a different loss on the same validation set.

Here is how I am calculating the loss

def get_log_loss(y_actual, y_pred):
    y_actual = y_actual.reshape(y_actual.shape[0])

    vec_actual = np.zeros(y_pred.shape)
    sizeOfSet = vec_actual.shape[0]
    vec_actual[np.arange(sizeOfSet), y_actual.astype(int)] = 1

    loss_sum = np.sum(vec_actual * np.log(y_pred))
    loss = -1.0 / sizeOfSet * loss_sum
    return loss

from nolearn.

dnouri commented on July 2, 2024

If you're not sure that you're calculating the loss right, maybe you should try and call your numpy version and the Theano version used by the net with the same values, and verify that they produce the same output.

Here's an implementation that I have lying around:

import scipy as sp

def logloss(y_true, y_pred):
    epsilon = 1e-18
    y_pred = sp.maximum(epsilon, y_pred)
    y_pred = sp.minimum(1 - epsilon, y_pred)
    ll = (sum(y_true * sp.log(y_pred) +
              sp.subtract(1, y_true) *
              sp.log(sp.subtract(1, y_pred)))
          )
    ll = ll * -1.0 / len(y_true)
    return ll

from nolearn.

run2 commented on July 2, 2024

Daniel - There is some problem somewhere. It would be great if you can validate the best val loss - from training w.r.t same loss after training - on any non regression net you have with you. If you get the same value then definitely I have mucked up somewhere. If not then - there is something not quite right somewhere. I am working on it too (few days now :()

from nolearn.

dnouri commented on July 2, 2024

@run2: Yes, I'm doing this on a classification net and it's giving me consistent results. Have you checked that your get_log_loss function is right?

from nolearn.

run2 commented on July 2, 2024

Daniel - the log loss by default for non regression problem is
return -T.mean(T.log(output)[T.arange(prediction.shape[0]), prediction])
Does that no equate with the get_log_loss code I have pasted above ?

y_pred is [nsamples,nclasses] (2D array) from predict_proba
y_actual is [nsamples,] 1D array of actual class labels

from nolearn.

dnouri commented on July 2, 2024

I just tried to reproduce this issue. There's a test called test_lasagne_functional_mnist, and I added this bit of code right after the line assert accuracy_score...:

    # assert accuracy_score(y_pred, y_test) > 0.85 ...

    from nolearn.lasagne import negative_log_likelihood
    X_train, X_valid, y_train, y_valid = nn.train_test_split(
        X_train, y_train, nn.eval_size)
    y_pred = nn.predict_proba(X_valid)
    loss = negative_log_likelihood(y_pred, y_valid).eval()
    assert abs(nn.train_history_[-1]['valid_loss'] - loss) < 0.01

So looks like it's matching up for this small example. Any more ideas?

from nolearn.

run2 commented on July 2, 2024

let me try that.

from nolearn.

run2 commented on July 2, 2024

So this is getting more tricky

I printed from within the train_test_split method, the size of my valid set. And it printed as 5509.
I printed the size of y_pred after calling predict_proba (as in your code) and I got shape[0] as 5500. So it had skipped 9 examples. Note my batch size is 20. That by itself can be a cause of the difference. But I am sure that is not the only reason
I could not get any further than that as your method failed with raise TypeError('index must be integers') in File "/home/debanjan/pythonrepos/Theano/theano/tensor/subtensor.py", line 1980, in as_index_variable. Though I checked the y_pred and y_valid variables and they seemd to be fine. It is not for the sizes not matching - I checked that.
May be a label encoding problem is lurking somewhere. I have printed the y_valid from within train_test_split - when it ran your code above - I got some labels - they were one more than the same print statement which fired during training. Remember, I am using label encoder - and my labels are integers - but starting from 1 (there is no 0). Thats the reason it is one more when calling predict_proba directly
It is possible that all labels may not be present in train and valid sets. Hmm. I am just loudly thinking..

Makes any sense ?

from nolearn.

run2 commented on July 2, 2024

Ah!! - finally it matches Daniel!. God I spent days on this.
So I think two reasons (I am still running some more tests)

the validation loss during training is skipping examples to fit the batch size. I would have thought, it should pad it (like n+p) and then while calculating the loss it should take the first n of the results and dimiss the last p. If I have a large batch size, this might cause quite a difference
while using label encoder, you need to make sure that while predicting on a test set, or validation set, the labels are encoded too. This is something I had already done - so that was not the problem
the method I have written does not equate the same result as the negative_log_likelihood result from theano. The problem mentioned above was sorted by type casting y_pred to np.int32. So I got a result from your code. But it was 0.006 off from the result from my code. I am not too happy to see that difference. I am running further tests to see how bad that difference can be.
For some reason, when I execute your code for log loss, it gives me back an array, not a value. I will check again tomorrow after I catch some sleep.

from nolearn.

run2 commented on July 2, 2024

Ok - I am wrong - it matches - but with the last validation loss - not the best validation loss.
I am still clueless - why it is not matching the best validation loss even though the right weights are being copied (from the best validation loss)

from nolearn.

run2 commented on July 2, 2024

Ok - Daniel. I have just solved this issue and it is a Bug

You need to check if _output_layer is None before you initialize layers in load_weights_from.
Right now - I am figuring out why - but if you have the code as below - it thinks it has loaded the weights - but it has not.

def load_weights_from(self, source):
    self._output_layer = self.initialize_layers()

If I change it to

if self._output_layer is None:
    self._output_layer = self.initialize_layers()

Then it works fine and I get the same validation error outside the train as I get inside - for the best validation loss.

Please try it out on your side and confirm
The points 1) from my previous to previous post is also another reason for the difference

from nolearn.

dnouri commented on July 2, 2024

@run2 Could you maybe help with reproducing this issue? I've added a test, but I'm not able to make it fail: a0769e0

from nolearn.

run2 commented on July 2, 2024

Daniel you need two things to reproduce this issue

Have a batch size which does not divide into you validation set size
Train a net (with Early Stopping) such that it improves for a while (storing the weights at every improvement) and then the validation loss does not improve for n epochs where n is your patience, and the net exists by loading the stored weights. Make sure - the last epoch is NOT an improvement epoch. Then assert the validation loss

Let me know if you cannot reproduce - then I will have to create some dummy data - which will take some time

from nolearn.

dnouri commented on July 2, 2024

I think the bug in load_weights_from that you describe in this comment might have been fixed since your report.

If I understand your report right, this means that only point 1) remains. If that's so, would you kindly distill a description of bug 1) and put it into a separate issue and then we'll have a look.

from nolearn.

dnouri commented on July 2, 2024

Closing due to lack of feedback.

from nolearn.

Cannot match validation loss from training when calculating after training about nolearn HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent