Comments (5)
Hi Thomas,
Glad you got the code running, out of interest, what were the specs of the computer you ran it on? I'm interested as this version of the code has the "manual" parallel processing aspects removed, although, some of the MATLAB functions (FFT, model training, etc.) are inherently multithreaded. Training on my machine using already extracted features takes around 2700s (it should be much faster if you run the training again, as the features are saved so don’t need extracting again – I’m assuming this is the stage that took up most of the time).
To answer your questions:
The output.csv is formatted as required by the Kaggle competition. Although the prediction column is called 'Class', you're correct, it should actually the prediction probability rather than the class label (which seems to be fairly common in Kaggle classification competitions). Generally, this means that if you apply a threshold yourself and submit class labels it'll work, but likely to increase the loss and harm your score.
In this case, the scoring is AUC, and the Kaggle scoring script normalises the probabilities first (I think by zscore). Regardless of the normalisation, the I don't think the absolute range of the values matters for calculating AUC - it's the distribution of values within the min-max range that matter, rather than the absolute values.
The AUC scores printed to the command window by train.m are the scores for the two individual models on the training data. The values are the average of the AUC for each of 6 CV folds – the individual scores are in RBTg.AUCs{k} and SVMg.AUCs{k}, where k is the fold.
You can also get the training AUC and plot for each model with the commands:
[X,Y,~, RBTAUC] = perfcurve(featuresTrain.labels, RBTg.predict(featuresTrain.dataSet), 1); plot(X,Y)
[X,Y,~, SVMAUC] = perfcurve(featuresTrain.labels, SVMg.predict(featuresTrain.dataSet), 1); plot(X,Y)
The score in the readme is the combined predictions of both models on the test data (as in the ‘Class’ column of the submission file), ie.
YPred = mean(zscore(RBTPred), zscore(SVMPred))
The combined predictions score significantly better than the individual model predictions, probably because the models both overfit slightly, but to different aspects of the data.
Predict.m won’t give a score itself as the labels for the test set are unknown, so the submission file needs to be submitted on Kaggle to get the final score – this may not score correctly for you, though, if you only have the first test set.
from kaggle-eeg.
My hardware is:
Processor: Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz (4 CPUs), ~2.2GHz
8 GB DDR3 Ram
Intel(R) HD Graphics 5500
Also, I decided to run the command below:
[X,Y,~, RBTAUC] = perfcurve(featuresTrain.labels, RBTg.predict(featuresTrain.dataSet), 1); plot(X,Y)
[X,Y,~, SVMAUC] = perfcurve(featuresTrain.labels, SVMg.predict(featuresTrain.dataSet), 1); plot(X,Y)
Those two lines of codes did give me a plot graph and two AUC scores. The SVM scored 0.9512 and the RBTAUC scored 0.8353. So I am curious, why is it that these AUC scores are higher than the one generated by train.m? Is it because they don't use cross validation?
I also tried to run:
YPred = mean(zscore2(RBTPred), zscore2(SVMPred))
But the command window kept saying Undefined function or variable 'RBTPred'.
I looked at zscore2.m and I am confused as to what arguments I should use on zscore since the function written as:
function [z,mu,sigma] = zscore2(x,flag,dim)
I have not fully understood your entire code since I am still learning the basics of MATLAB, so I wouldn't be surprised if there was just something minor that I am missing. I would be really thankful if you can help me generate this combined score of SVM and RBT. And thanks for the help that you have already given me so far.
from kaggle-eeg.
I'm not entirely sure about the differences in the AUC values. The models both use K-fold cross-validation, and the fit for each fold has it's own AUC score. The value printed to the command line is the average value of these scores. The score from the code above is the AUC calculated after averaging across the predictions from each fold. These values will be different, but it's not immediately clear why they're so different...
When you run
YPred = mean(zscore2(RBTPred), zscore2(SVMPred))
Get the predictions from each model and save them in the variables RBTPred and SVMPred first - ie.
RBTPred = RBTg.predict(featuresTrain.dataSet)
SVMPred = SVMg.predict(featuresTrain.dataSet)
YPred = mean(zscore2(RBTPred), zscore2(SVMPred))
The zscore2 function is the same as MATLAB's zscore function, but it handles NaNs using the nanmean and nanstd functions (rather than the mean and std functions). It doesn't need the second two inputs in this case so don't worry about those.
Once you have the combined predictions (in YPred) you can then get the AUC in a similar way as with the individual models:
[X,Y,~, overallAUC] = perfcurve(featuresTrain.labels, YPred, 1); plot(X,Y)
Exactly what this value will be for the training data, I'm not sure!
from kaggle-eeg.
Okay, I will send you an email now. Also, I tried running these series of commands that you showed me:
RBTPred = RBTg.predict(featuresTrain.dataSet)
SVMPred = SVMg.predict(featuresTrain.dataSet)
YPred = mean(zscore2(RBTPred), zscore2(SVMPred))
But I got an error on YPred. I did successfully saved SVMPred and RBTPred, but the mean function keeps throwing this error:
Error using sum
Dimension argument must be a positive integer scalar within indexing range.
Error in mean (line 116)
y = sum(x, dim, flag)/size(x,dim);
So it looks like the sum function does not accept negative values. I noticed that the second argument in sum is dim
, so that seems to explain why it does not accept negative values seeing as how dimension cannot be negative. So should I try to combine both RBTPred and SVMPred together? If so, how would I do it in a way that would reflect an accurate score?
from kaggle-eeg.
Sorry for the confusion, that's actually the wrong command, it's missing the concatenation with []. It should be:
YPred = nanmean([zscore2(RBTPred), zscore2(SVMPred], 2)
When it comes to predicting the test set there's an additional step as well. The sets are subdivided into short windows but the Kaggle submission only needs one prediction per 10 minute file. The predictions for each 10 minute file are averaged in this part of predict.m:
% Predict for each epoch
% Using seizureModel.predict()
preds.Epochs.RBTg = RBTg.predict(featuresTest.dataSet);
preds.Epochs.SVMg = SVMg.predict(featuresTest.dataSet);
% Compress predictions nEpochs -> nFiles (nSegs)
% Take predictions for all epochs, reduces these down to length of fileList
% Total number of epochs
nEps = height(featuresTest.dataSet);
% Number of epochs per subSeg
eps = featuresTest.SSL.Of(1);
% Convert SubSegID to 1:height(fileList)
accArray = reshape(repmat((1:nEps/eps),eps,1), 1, nEps)';
% Use to accumulate values and average
fns = fieldnames(preds.Epochs);
for f = 1:numel(fns)
fn = fns{f};
preds.Segs.(fn) = accumarray(accArray, preds.Epochs.(fn))/eps;
end
clear accArray
Then the final step (including the across-model normalisation and averaging bit):
% Combined sub: SVMg and RBTg
saveSub([note,'SVMgRBTg'], featuresTest.fileLists, ...
nanmean([zscore2(preds.Segs.RBTg),zscore2(preds.Segs.SVMg)],2), ...
params)
from kaggle-eeg.
Related Issues (18)
- train.m error HOT 3
- Original Kaggle data HOT 1
- Running the code HOT 4
- Features Object_checkFiles
- Run Time HOT 3
- Solution File HOT 5
- Training and Testing HOT 5
- Training and Testing
- Generating figures for Feature comparison HOT 1
- Private AUC is different HOT 1
- Feature information not saved in seizureModels
- Model feature names
- Training two SVMs instead of SVM and RBT HOT 1
- Redundant import methods in featuresObject
- Data-sub path setting is set in featuresObject
- Temporary change to data sub paths HOT 1
- Predict.m error HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kaggle-eeg.