zcapjdb / ukaeagroupproject Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 48.22 MB

Python 0.58% Jupyter Notebook 99.42%

ukaeagroupproject's People

Contributors

Stargazers

Watchers

ukaeagroupproject's Issues

Fix naming: validation data named as test in output_dict, test named as holdout.

Refactoring needed?

The pipeline could do with some refactoring, which will be useful in a CL setting.

I propose to have a CLTaskHandler class that loads the data for all tasks, and is given the sequence in which tasks should be learned.

class CLTaskHandler:
def init(tasks, task_sequence):
#...initialise data and put them in a dictionary that numbers each task
# initialise self.classifier = None and self.regressor = None
# initialise losses?
def AL(acquisition_function, task_number):
# calls another class that runs the AL pipeline for given task number and with given acquisition, modifies classifier and regressor attributes
def shrink_perturb():
#shrink_perturb the weights between tasks
def orthogonalise_gradients():
# orthogonalises gradients before new task
def Elastic_weight_consolidation():
#implements EwC

(Sorry the markdown seems broken)

The AL pipeline is a class which takes as input the train, valid and test set in the init (from here) has a method to perform the initial training (from here), and has a method run_pipeline which does everything that follows (from here). The latter can be broken down further, but it's not essential at the moment.

select_unstable_data

I don't understand what this line does

I think the current implementation would retrain the classifier on all the misclassified point (correct me if I'm wrong). This is not what should happen, as in a real world situation we wouldn't have the labels already for all the candidates. Instead, we should trust the classifier at the start, this returns ~20% of the candidates as being unstable. We get the labels for those 20% only and some of those will be misclassified. In our situation where the candidate pool is ~10_000 that's about 200 misclassified points, which is too little to retrain the classifier. So we should build a buffer that fills at each iteration, and every time it reaches, say, 500, retrain the classifier. So the classifier is trained for fewer epochs than the regressor in total. Makes sense?

add classifier loss in outputs

Currently these losses are empty

Memory Issues in train_models fn

Currently using the train models function to train the models from scratch on the fulldataset leads to memory issues for a batch size of anything larger than 64, the issue seems to be arising from the DataLoader object when attempting to iterate through it. This should probably be fixed if we have time.

consistency

There are consistency issues about how you use dataloaders/datasets. For example, Models.train_model() takes datasets as input, while pt.retrain_regressor() takes dataloaders. The predict method of the regressor also takes dataloaders, while regressor_uncertainty takes datasets.

It would be good to decide whether it's all dataloaders in the main, or all datasets and dataloaders are then defined in each function.

mutliple fluxes data prep

Why is the only the first flux kept in the data prep? You pass all fluxes to prepare_data, but then only ever invoke fluxes[0] in there:

UKAEAGroupProject/src/pipeline/pipeline_tools.py

Line 79 in 06c06d4

target_column = fluxes[0]

maybe it's the data on my end?

I've re run the AL pipeline from main, and I am getting the same smaller losses (order 0.08 instead 0.12). here's the config
pipeline_config_fixscaler.txt. @thandi1908 @zcapjdb do you want to try and run from main with this config and see what you get?

BTW I had re-run the data generation, with a new table that Aaron gave to me (that has other columns), so maybe the data has also changed. I think it makes sense that I get ~0.025 with AL on a small dataset, given that the benchmark is this for a 90% random reduction of the data (from a MSc thesis on QLKNN)
. (Note that the unscaled loss that I'm getting is still around 10 - so not as bad as 100s as in @zcapjdb 's case), while it should be more like 5-7..)

This does make me think we had an older dataset (for some reason out of my control)... What is the best place to upload the dataset for you to transfer as quickly as possible on your machines?

Continual learning and active learning

@thandi1908 @zcapjdb please read before tomorrow

Suppose we train on dataset S1, then we enrich it with more data S2, S1 and S2 come from the same distribution. The training performance will of course change, and the model trained on S1 won't work on S1 data only anymore. While on average there might be some drift in the performance on the first model on S1, this is expected - NNs overfit easily!! So it's actually a good thing that on S1 the performance deteriorates.

The point is, you can't assess model performance based on the training set! The holdout set is the one that should always be looked at, if S1 and S2 are from the same distribution. This is probably why the shrink-perturb method doesn't matter too much for us.

BUT, if S1 and S2 are not from the same distribution, we would like our model to remember about task1 while training task2 - this is continual learning. This is for example if you want to predict things for turbulence from data for different tokamaks, having the same input variables. This entails two validation sets, one for task 1 and one for task2. The model is trained on task 1, but then the validation of task 1 is monitored as the model is trained on task 2. Sure you can have two models, one trained on S1 and then fine tune it on S2. But having just one would be better. Also, with AL you can pick only the most relevant data from S2, which means you'd need to run qualikiz only a few times. This is what I'd like to do for the broader paper.

So we can start only with the AL pipeline for the full dataset, no shrink-perturb or similar (but we need a better acquisition than only MSE). Then we can move to using AL for continual learning from one task to another - train on S1 with AL, then train only on S2 still with AL, but using shrink-perturb or other techniques like elastic weight consolidation https://aiperspectives.springeropen.com/articles/10.1186/s42467-021-00009-8 This paper is a good way to start, see page 7 and 8 (they also detect a change in probability distribution of data, but we won't need that as we have labels for it)

Particle Flow Variables for Second Regressor

I believe we discussed using one of the particle flow variables for the second regressor, e.g. pfiitg_gb_div_efiitg_gb but looking back at our results from when we initally created regressors for all the outputs, this was one of the variables that our regressors failed on.

I remember we discussed there was some issue with the dataset we were provided, Do you know if this was one of those problematic variables @lorenzozanisi

code testing

Guys, please make sure the code that you push does work, and when accepting a pull request see if that would cause problems. For example, here you have an if clause to declare regressor_var, but a few lines below this is ignored

scaling... again

Hi @zcapjdb @thandi1908

I was trying to reproduce your results because some CL results seemed odd, but I'm getting a strange thing with the scaling (again!).

Looking at main (which I assume is where the paper results are coming from), it seems that the line where the regressor's scaler is updated is commented.

Un-commenting that line has quite drastic results on the results of the paper,

The left plot is similar to what is in the paper, but it's obtained with the scaler bit commented, which I thought we had settled to keep uncommented (this is how I have been doing the CL all this time) - but maybe I don't recall correctly? Could you please confirm this when you can?

Output dictionary for two outputs

at the moment all losses for both regressors are appended to the same output dictionary. Wouldn't it be easier to have N dictionaries, one for each regressor, and one dictionary only for the classifier? We could have
output_dict = {flux1:dict_flux1, flux2:dict_flux2, class: dict_class} .
@zcapjdb @thandi1908 Wouldn't that make things easier both in the code and in the postprocessing?

pkl file generator

Could you please push the script to generate the input pkl files? It used to be in QLKNNDataPreparation.ipynb in earlier commits , but the "stable_label" is not set in there in the newest version. I assume it's just creating a column for "stable_label" and make it 0 or 1 depending on whether the leading flux is 0 or positive?

scaler

At the moment the scaler is initialised at the start of the pipeline and never recomputed every time new training data (with labels) is acquired. This is probably suboptimal (as we acquire more data, it's more likely that the original scaler is more and more "wrong"). Perhaps changing this is a bit cumbersome right now, but we'll need to do it at some point.

Found the damn error in CL

Found the damn error!

There was something wrong in the scaling... basically, what is correct is to normalise data from new tasks to the unit gaussian of the previous task, and update the normalisation as data from the new distribution would come in. Instead, I was normalising the incoming data to the unit gaussian according to the distribution of the new task, then appending to the data of the old task which had been normalised with the old scaler. That's why I was getting all those spikes in the test curve - because the data distribution was not changing smoothly at all!

See diff from this commit (b12c3fe) to last push (c8ebc2b).

This is what I get now and I think it makes way more sense:

Incidentally, yes active learning is still better than random (I've used the version where the regressor scaler is updated). @zcapjdb @thandi1908 thoughts?

The final scatter plots is this:

There are a few nasty outliers which probably skew the MSE a bit.

Just for full disclosure, the previous way the scaling was done also resulted in heavily negative predictions as noted in #116

scaler

Hi Jackson and Thandi,

I'm starting to play with your code and I'm not sure I understand a couple of things when you scale the data

In the function prepare_model you don't seem to be using the same scaler of the training data also for validation and testing, is that right?
You do have an "own_scaler" argument in the scale method of the QLKNNDataset class,

def scale(self, own_scaler: object = None):
        if own_scaler is not None:
            self.data = ScaleData(self.data, own_scaler)

        self.data, QLKNNDataset.scaler = ScaleData(self.data, self.scaler)

In your prepare_model function you need to pass the scaler for the test data for validation and test scaling. There is an else statement missing - self.data assigned within the if statement will be overridden by the last line. Also, why do you assign the scaler to QLKNNDataset.scaler? Shouldn't it be self.data, self.scaler = ScaleData(self.data, self.scaler)
3.

hyperparameter tuning on random baseline

If we are to claim that the random baseline (i.e. no active learning) performs better than AL, then we must be sure that the random baseline is finetuned. While doing an architecture search is too much, we should at least try different values of dropout and weight decay.

regressor predict method

In the regressor, the predict method is a bit obscure, here. What is order_outputs? Why is it important to return idx_array?

Test loss confusion

There are currently three different parts of the code that append to the test loss of the output dictionary, the holdout loss gets added twice, here and here. The test_loss variable is also added to it here.

This confuses things and it should be clearer what each is referrring to

Set random seed for reproducibility

Random seed must be set for reproducibility where we initialise models, or perform any kind of sampling

Dataset problems

Hi @lorenzozanisi

we are struggling to get a variable for the second regressor that works well. The variable we have been investigating is "efeitg_gb_div_efiitg_gb" For this variable we except that the only NaN values it contains are for rows where efiitg is 0.

If we only looking at the subset of the data where "efeitg_gb_div_efiitg_gb" is NaN we expect all the corresponding "stable labels to be 0. However this isn't the case. The situation is summarised in the screenshot attached. Do you know why some unstable points would lead to a NaN value for "efeitg_gb_div_efiitg_gb"? or is this an issue in the dataset?

zcapjdb / ukaeagroupproject Goto Github PK

ukaeagroupproject's People

Contributors

Stargazers

Watchers

ukaeagroupproject's Issues

Recommend Projects

Recommend Topics

Recommend Org