Different best config in each k_fold about gnn-comparison HOT 6 CLOSED

diningphil commented on September 12, 2024

Different best config in each k_fold

from gnn-comparison.

Comments (6)

wymli commented on September 12, 2024

I run your experiment,and get results like this:

{"best_config": {"config": {"model": "ECC", "device": "cuda", "batch_size": 32, "learning_rate": 0.1, "classifier_epochs": 1000,
"optimizer": "SGD", "scheduler": {"class": "ECCLR", "args": {"gamma": 0.1, "step_size": 10}}, "loss": "MulticlassClassificationLoss",
"gradient_clipping": null, "early_stopper": {"class": "Patience", "args": {"patience": 500, "use_loss": false}}, "shuffle": true, "l2": 0.0,
"dropout": 0.05, "dropout_final": 0.1, "num_layers": 2, "dim_embedding": 32, "dataset": "bbbp"},
"TR_score": 0.9383519814222812, "VL_score": 0.8699779249448123}, "OUTER_TR": 0.8983603755755274, "OUTER_TS": 0.7846266237472319}

{"best_config": {"config": {"model": "ECC", "device": "cuda", "batch_size": 32, "learning_rate": 0.1, "classifier_epochs": 1000,
"optimizer": "SGD", "scheduler": {"class": "ECCLR", "args": {"gamma": 0.1, "step_size": 10}}, "loss": "MulticlassClassificationLoss",
"gradient_clipping": null, "early_stopper": {"class": "Patience", "args": {"patience": 500, "use_loss": true}}, "shuffle": true, "l2": 0.0,
"dropout": 0.05, "dropout_final": 0.1, "num_layers": 2, "dim_embedding": 64, "dataset": "bbbp"},
"TR_score": 0.9381171050338343, "VL_score": 0.840649692712906}, "OUTER_TR": 0.8686136180509775, "OUTER_TS": 0.8346322009403071}

{"best_config": {"config": {"model": "ECC", "device": "cuda", "batch_size": 32, "learning_rate": 0.1, "classifier_epochs": 1000,
"optimizer": "SGD", "scheduler": {"class": "ECCLR", "args": {"gamma": 0.1, "step_size": 10}}, "loss": "MulticlassClassificationLoss",
"gradient_clipping": null, "early_stopper": {"class": "Patience", "args": {"patience": 500, "use_loss": false}}, "shuffle": true, "l2": 0.0,
"dropout": 0.05, "dropout_final": 0.1, "num_layers": 2, "dim_embedding": 64, "dataset": "bbbp"},
"TR_score": 0.9463063212497737, "VL_score": 0.8487089925347578}, "OUTER_TR": 0.951178197317892, "OUTER_TS": 0.8494901713444759}

{"best_config": {"config": {"model": "ECC", "device": "cuda", "batch_size": 32, "learning_rate": 0.01, "classifier_epochs": 1000,
"optimizer": "SGD", "scheduler": {"class": "ECCLR", "args": {"gamma": 0.1, "step_size": 10}}, "loss": "MulticlassClassificationLoss",
"gradient_clipping": null, "early_stopper": {"class": "Patience", "args": {"patience": 500, "use_loss": false}}, "shuffle": true, "l2": 0.0,
"dropout": 0.05, "dropout_final": 0.1, "num_layers": 2, "dim_embedding": 64, "dataset": "bbbp"},
"TR_score": 0.8298400357665173, "VL_score": 0.8640853959716811}, "OUTER_TR": 0.8546555830876911, "OUTER_TS": 0.7710857477319286}

{"best_config": {"config": {"model": "ECC", "device": "cuda", "batch_size": 32, `"learning_rate": 0.1`, "classifier_epochs": 1000,
"optimizer": "SGD", "scheduler": {"class": "ECCLR", "args": {"gamma": 0.1, "step_size": 10}}, "loss": "MulticlassClassificationLoss",
"gradient_clipping": null, "early_stopper": {"class": "Patience", "args": {"patience": 500, "use_loss": false}}, "shuffle": true, "l2": 0.0,
"dropout": 0.05, "dropout_final": 0.1, "num_layers": 2, `"dim_embedding": 64`, "dataset": "bbbp"},
"TR_score": 0.8259295808005054, "VL_score": 0.8700949453926788}, "OUTER_TR": 0.899964474619089, "OUTER_TS": 0.7435839880284324}

I'm wondering that how i can choose my real best config given that i have different best configs in every different k-fold?
Should I choose the config that has the best performance or the one that has the most occurrences? Or some choice else?
Thanks!

from gnn-comparison.

diningphil commented on September 12, 2024

Hi,

This is an important point that we also discuss in our paper (see Sections 3-4 and the Appendix).

TL;DR: There is no overall "best configuration" when doing risk assessment.

When you use k-fold cross-validation to compute an estimate of the performance of your model (i.e., risk assessment), there exists no overall "real best config". You should average the k OUTER_TS results you obtained to get the score you need.
Instead, when we talk about "best config", we usually refer to the process of model selection on a validation set. However, we must not select a best configuration using the k test sets used for risk assessment. This is one of the things we have criticized in our paper.
In k-fold cross-validation for risk assessment, you end up doing model selection k times. Therefore, it is perfectly normal to have k different best configurations, and you should not choose one of them. If you select one of them based on the OUTER_TS (test) results, that corresponds to "cheating". This is why it is way harder to get state of the art results, but eventually the performance estimate will be robust :).

A picture is worth a thousand words (taken from the paper):

Note that if you have to deploy your solution in production, that is a different story. When you already have your performance estimate using the above techniques, one possibility is to re-train your model with a simple train/validation/test split to find the best configuration on a validation set. That gives you a single trained model that you can reuse for prediction.

I hope this clarifies.

from gnn-comparison.

pursueorigin commented on September 12, 2024

Hi,

In K_Fold_Assessment.py,

gnn-comparison/evaluation/risk_assessment/K_Fold_Assessment.py

Line 72 in 5722c3c

for outer_k in range(self.outer_folds):

conducts model selection for each outer fold. Here, for different folds, this method may find model configurations. It is correct when selecting some hyperparameters, such as learning rates. However, I do NOT think it is correct when selecting the number of layers or the hidden dimension, such as

gnn-comparison/config_DGCNN.yml

Line 43 in 5722c3c

num_layers:

. For different folds, we should use the same network. With your method, different folds may have different neural network architectures. Am I right?

from gnn-comparison.

diningphil commented on September 12, 2024

Hi,
You are right, different folds may have different architectures, but that is the point of performing model selection. By treating the number of layers as a hyper-parameter, you are more flexible in finding the right architecture for each data fold. Why would you want to fix that in advance? It depends on the data just as the learning rate. Please let me know if this answer your question :).

from gnn-comparison.

pursueorigin commented on September 12, 2024

Different architectures correspond to different models. If we use different architectures for different folds when evaluating the model performance, which model are we evaluating?

I didn't see any previous works doing like this. Could you please provide the literature supporting your method?

from gnn-comparison.

diningphil commented on September 12, 2024

Model assessment, by definition, evaluates the performance of the family of models associated with a specific method. There is no single architecture after model assessment because the reason why we do this is to get an estimate of the true error, not an architecture.

The misconception that we are evaluating a single architecture leads to papers where authors find the best hyper-parameters on the test set rather than validation (this was the reason why we published our paper). There exists no "best architecture" in general (literature: "Understanding Machine Learning" book, see "No free lunch theorem").

However, do not take my word for it: See part 3 of Samy Bengio's lecture and the associated references on Statistical Learning Theory.

from gnn-comparison.

Different best config in each k_fold about gnn-comparison HOT 6 CLOSED

Comments (6)

I run your experiment,and get results like this:

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent