Git Product home page Git Product logo

Comments (5)

garethjns avatar garethjns commented on May 30, 2024

Hi Thomas,

Glad you got the code running, out of interest, what were the specs of the computer you ran it on? I'm interested as this version of the code has the "manual" parallel processing aspects removed, although, some of the MATLAB functions (FFT, model training, etc.) are inherently multithreaded. Training on my machine using already extracted features takes around 2700s (it should be much faster if you run the training again, as the features are saved so don’t need extracting again – I’m assuming this is the stage that took up most of the time).

To answer your questions:

The output.csv is formatted as required by the Kaggle competition. Although the prediction column is called 'Class', you're correct, it should actually the prediction probability rather than the class label (which seems to be fairly common in Kaggle classification competitions). Generally, this means that if you apply a threshold yourself and submit class labels it'll work, but likely to increase the loss and harm your score.

In this case, the scoring is AUC, and the Kaggle scoring script normalises the probabilities first (I think by zscore). Regardless of the normalisation, the I don't think the absolute range of the values matters for calculating AUC - it's the distribution of values within the min-max range that matter, rather than the absolute values.

The AUC scores printed to the command window by train.m are the scores for the two individual models on the training data. The values are the average of the AUC for each of 6 CV folds – the individual scores are in RBTg.AUCs{k} and SVMg.AUCs{k}, where k is the fold.

You can also get the training AUC and plot for each model with the commands:

[X,Y,~, RBTAUC] = perfcurve(featuresTrain.labels, RBTg.predict(featuresTrain.dataSet), 1); plot(X,Y)
[X,Y,~, SVMAUC] = perfcurve(featuresTrain.labels, SVMg.predict(featuresTrain.dataSet), 1); plot(X,Y)

The score in the readme is the combined predictions of both models on the test data (as in the ‘Class’ column of the submission file), ie.

YPred = mean(zscore(RBTPred), zscore(SVMPred))

The combined predictions score significantly better than the individual model predictions, probably because the models both overfit slightly, but to different aspects of the data.

Predict.m won’t give a score itself as the labels for the test set are unknown, so the submission file needs to be submitted on Kaggle to get the final score – this may not score correctly for you, though, if you only have the first test set.

from kaggle-eeg.

ThomasDang93 avatar ThomasDang93 commented on May 30, 2024

My hardware is:

Processor: Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz (4 CPUs), ~2.2GHz
8 GB DDR3 Ram
Intel(R) HD Graphics 5500

Also, I decided to run the command below:

[X,Y,~, RBTAUC] = perfcurve(featuresTrain.labels, RBTg.predict(featuresTrain.dataSet), 1); plot(X,Y)
[X,Y,~, SVMAUC] = perfcurve(featuresTrain.labels, SVMg.predict(featuresTrain.dataSet), 1); plot(X,Y)

Those two lines of codes did give me a plot graph and two AUC scores. The SVM scored 0.9512 and the RBTAUC scored 0.8353. So I am curious, why is it that these AUC scores are higher than the one generated by train.m? Is it because they don't use cross validation?

I also tried to run:
YPred = mean(zscore2(RBTPred), zscore2(SVMPred))
But the command window kept saying Undefined function or variable 'RBTPred'.

I looked at zscore2.m and I am confused as to what arguments I should use on zscore since the function written as:

function [z,mu,sigma] = zscore2(x,flag,dim)

I have not fully understood your entire code since I am still learning the basics of MATLAB, so I wouldn't be surprised if there was just something minor that I am missing. I would be really thankful if you can help me generate this combined score of SVM and RBT. And thanks for the help that you have already given me so far.

from kaggle-eeg.

garethjns avatar garethjns commented on May 30, 2024

I'm not entirely sure about the differences in the AUC values. The models both use K-fold cross-validation, and the fit for each fold has it's own AUC score. The value printed to the command line is the average value of these scores. The score from the code above is the AUC calculated after averaging across the predictions from each fold. These values will be different, but it's not immediately clear why they're so different...

When you run

YPred = mean(zscore2(RBTPred), zscore2(SVMPred))

Get the predictions from each model and save them in the variables RBTPred and SVMPred first - ie.

RBTPred = RBTg.predict(featuresTrain.dataSet)
SVMPred = SVMg.predict(featuresTrain.dataSet)
YPred = mean(zscore2(RBTPred), zscore2(SVMPred))

The zscore2 function is the same as MATLAB's zscore function, but it handles NaNs using the nanmean and nanstd functions (rather than the mean and std functions). It doesn't need the second two inputs in this case so don't worry about those.

Once you have the combined predictions (in YPred) you can then get the AUC in a similar way as with the individual models:

[X,Y,~, overallAUC] = perfcurve(featuresTrain.labels, YPred, 1); plot(X,Y)

Exactly what this value will be for the training data, I'm not sure!

from kaggle-eeg.

ThomasDang93 avatar ThomasDang93 commented on May 30, 2024

Okay, I will send you an email now. Also, I tried running these series of commands that you showed me:

RBTPred = RBTg.predict(featuresTrain.dataSet)
SVMPred = SVMg.predict(featuresTrain.dataSet)
YPred = mean(zscore2(RBTPred), zscore2(SVMPred))

But I got an error on YPred. I did successfully saved SVMPred and RBTPred, but the mean function keeps throwing this error:

Error using sum
Dimension argument must be a positive integer scalar within indexing range.

Error in mean (line 116)
        y = sum(x, dim, flag)/size(x,dim);

So it looks like the sum function does not accept negative values. I noticed that the second argument in sum is dim, so that seems to explain why it does not accept negative values seeing as how dimension cannot be negative. So should I try to combine both RBTPred and SVMPred together? If so, how would I do it in a way that would reflect an accurate score?

from kaggle-eeg.

garethjns avatar garethjns commented on May 30, 2024

Sorry for the confusion, that's actually the wrong command, it's missing the concatenation with []. It should be:

YPred = nanmean([zscore2(RBTPred), zscore2(SVMPred], 2)

When it comes to predicting the test set there's an additional step as well. The sets are subdivided into short windows but the Kaggle submission only needs one prediction per 10 minute file. The predictions for each 10 minute file are averaged in this part of predict.m:

% Predict for each epoch
% Using seizureModel.predict()
preds.Epochs.RBTg = RBTg.predict(featuresTest.dataSet);
preds.Epochs.SVMg = SVMg.predict(featuresTest.dataSet);

% Compress predictions nEpochs -> nFiles (nSegs)
% Take predictions for all epochs, reduces these down to length of fileList
% Total number of epochs
nEps = height(featuresTest.dataSet);
% Number of epochs per subSeg
eps = featuresTest.SSL.Of(1);

% Convert SubSegID to 1:height(fileList)
accArray = reshape(repmat((1:nEps/eps),eps,1), 1, nEps)';

% Use to accumulate values and average
fns = fieldnames(preds.Epochs);
for f = 1:numel(fns)
    fn = fns{f};
    preds.Segs.(fn) = accumarray(accArray, preds.Epochs.(fn))/eps;
end
clear accArray

Then the final step (including the across-model normalisation and averaging bit):

% Combined sub: SVMg and RBTg
saveSub([note,'SVMgRBTg'], featuresTest.fileLists, ...
    nanmean([zscore2(preds.Segs.RBTg),zscore2(preds.Segs.SVMg)],2), ...
    params)

from kaggle-eeg.

Related Issues (18)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.